# batched_energyentropy_acquisition_for_bayesian_optimization__db9c79d0.pdf

Batched Energy-Entropy acquisition for Bayesian Optimization

Felix Teufel12 Carsten Stahlhut1 Jesper Ferkinghoff-Borg1

1Machine Intelligence, Novo Nordisk A/S 2Department of Biology, University of Copenhagen {fegt,ctqs,jfgb}@novonordisk.com

Bayesian optimization (BO) is an attractive machine learning framework for performing sample-efficient global optimization of black-box functions. The optimization process is guided by an acquisition function that selects points to acquire in each round of BO. In batched BO, when multiple points are acquired in parallel, commonly used acquisition functions are often high-dimensional and intractable, leading to the use of sampling-based alternatives. We propose a statistical physics inspired acquisition function for BO with Gaussian processes that can natively handle batches. Batched Energy-Entropy acquisition for BO (BEEBO) enables tight control of the explore-exploit trade-off of the optimization process and generalizes to heteroskedastic black-box problems. We demonstrate the applicability of BEEBO on a range of problems, showing competitive performance to existing methods.

1 Introduction

Figure 1: q-UCB does not allow for controlling its explore-exploit trade-off with large batches. A GP surrogate (background) was initialized with 100 random points of the Ackley function. q-UCB was run with κ = 0.1 and κ = 100, BEEBO with T = 0.05 and T =50. Batch size Q=100.

Bayesian Optimization (BO) has since its inception [1, 2] made a profound contribution to the realm of global optimization of black-box functions through the usage of Bayesian statistics. For global optimization problems pursuing x = argmaxx X ftrue(x), BO has surfaced as a premier strategy for efficiently handling especially complex and costly unknown functions, ftrue(x). While BO is traditionally formulated in a single-point scenario, where individual points are queried and results are observed sequentially, there are situations where batched acquisition is needed. Such situations arise when ftrue(x) is expensive to evaluate in either time or cost, but can be effectively evaluated in parallel by dispatching multiple experiments, reducing the overall optimization time. This is often the case in e.g. drug discovery, materials design or hyperparameter tuning for deep models [3, 4, 5, 6, 7].

The realization that BO could be employed for the training of deep neural networks, as suggested by [3], sparked renewed research interest, with advancements encompassing a variety of areas, including the generalization to accommo-

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

date noisy inputs [8, 9], heteroskedastic noise [10, 11], multi-task problems [12], multi-fidelity [13], high-dimensional input spaces [14], and parallel methods with batch queries [15, 16]. Generally, these desired properties are addressed by customizing one of the two key components in BO, either the surrogate model or the acquisition function. The surrogate model f approximates the black-box function ftrue using the available data. In BO, the surrogate is formulated from a Bayesian perspective, allowing us to quantify the model s uncertainty when evaluating new points. Typically, the model of choice is a Gaussian Process (GP) [17]. The acquisition function is responsible for guiding the selection of new input point(s) to evaluate at each optimization step, utilizing the surrogate model to identify promising regions in the input domain and exploring the unknown function further.

Any acquisition process needs to trade off exploration (reducing uncertainty to learn a better surrogate model) against exploitation (selecting points with a high expected ftrue(x) based on the current surrogate). In this work, we are particularly interested in acquisition processes that make this trade-off controllable using a hyperparameter. Controllability can be a desirable property if e.g. domain knowledge relating to the difficulty of the optimization process and the quality of the surrogate model is available, or if the strategy needs to be adjusted depending on future experimental budgets. Similarly, it can be desirable to acquire multiple x with high ftrue(x) in a batch (as opposed to just finding the optimum x , with the remaining x being considered explorative). This is useful when optima identified in BO can be subject to constraints that are unknown at optimization time, but may render x intractable [18]. Such constraints arise when the ftrue explored in BO is a necessary simplification of the actual objective. Practical examples include e.g. the synthesizability of a material at larger scale, when BO experiments are performed at lab scale; or the in vivo activity of a molecule with BO experiments performed in vitro.

A wide range of batch mode acquisition functions has been proposed, with approaches often leveraging random sampling strategies or Monte Carlo (MC) integration, which can adversely affect controllability for large batches (Figure 1). In contrast, we here introduce BEEBO (Batched Energy Entropy acquisition for BO), a statistical physics inspired acquisition function for BO with GP surrogate models that natively generalizes to batched acquisition. BEEBO enables

Parallel gradient-based optimization of the inputs, without requiring sampling or Monte Carlo integrals. Tight control of the explore-exploit trade-off in batch mode using a single temperature hyperparameter. Risk-averse BO under heteroskedastic noise.

We demonstrate the application of BEEBO on a wide range of test problems, and investigate its behaviour under heteroskedastic noise.

2 Related works

Batch variants of traditional strategies Parallel acquisition in BO has seen a variety of approaches, often starting from established single-point acquisition functions like probability of improvement (PI), expected improvement (EI), knowledge gradient (KG) or upper confidence bound (UCB) [2, 19, 20, 21, 22]. Reformulating these to batch mode with Q query points, we obtain q-PI, q EI, and q-UCB [23, 24]. While the single-point specifications provide an analytical form and enable gradient-based optimization, batch expressions are more challenging and require different optimization strategies, typically involving greedy algorithms [25] or deriving an integral expression over multiple points.

For instance, in the popular EI acquisition function, a single point is selected by maximizing the expression a EI (x) = E[max (0, f (x) f t )] = R max (0, f (x) f t ) P (f|x) df. Here f t represents the best observed evaluation of ftrue so far. With a surrogate model in the form of a GP, the acquisition function depends only on the predictive mean and variance functions, µ (x) and C (x). Effectively, we need to evaluate the cumulative normal distribution, which quickly becomes intractable for large batch sizes and approximating the gradient of the q-EI acquisition function typically requires MC estimation [26, 27]. However, proper MC integration can be laborious and is sensitive to both the dimension of the problem and the choice of batch size Q. Specifically, MC methods face the curse of dimensionality problem when applied to high-dimensional integrals, as they require an exponentially increasing number of sample points to maintain accuracy, making them

computationally impractical for such tasks [28, 29]. Of particular interest is Wilson et al. [24], in which they adopt the reparameterization trick [30, 31] on acquisition functions integrals, enabling gradient based approaches to the optimization of PI, EI, and UCB. This demonstrates particular usefulness in modest to higher dimensions.

While EI trades off exploration and exploitation, users do not have a direct control over the balance. To alleviate this, Sobester et al. [32] proposed a weighted EI formulation. An alternative strategy with an explicit explore-exploit trade-off is offered by the UCB acquisition function, a UCB (x) = µ(x) + κ p

C(x), which directly expresses exploration and exploitation as two terms, traded off by the parameter κ. As we are particularly interested in enabling this direct user control, we focus our primary comparison on q-UCB in the main text, while a more extensive comparison with alternative methods can be found in Appendix B, both theoretically and experimentally.

Greedy strategies As mentioned, a popular approach for leveraging single-point acquisition functions is devising batch filling strategies that score candidate points sequentially. Kriging Believer (KB) [33] uses EI to select points and iteratively updates the GP by fantasizing an observation with the posterior mean. Likewise, GP-BUCB [34] uses fantasized observations to update p

C(x) at each step. Local penalization (LP) [35] introduces a penalization function that repulses selection away from already selected points. Contal et al. [36] propose selecting a single point using UCB and dedicating the remainder of the batch budget for exploration in a restricted region around the believed optimum. GLASSES [37] treats batch selection as a multi-step lookahead problem to overcome the myopia of only considering the immediate effect of selecting a point.

Entropy based strategies From an information theory perspective, BO can be interpreted as seeking to reduce uncertainty over the location of optima of the unknown function. This has given rise to entropy-based acquisition functions such as entropy search (ES) [38], predictive entropy search (PES) [39] and max-value entropy search (MES) [13, 40, 41]. MES is distinct in that it seeks to quantify the mutual information between the unknown ftrue(x ) and the observations y|D, rather than the location of x . General-purpose Information-Based Bayesian Optimizatio N (GIBBON) [42] provides an extension of MES that enables application to batched acquisition as well as other challenges such as multi-fidelity BO. GIBBON proposes a lower bound formulation for the intractable batch MES criterion, which is then optimized using greedy selection. Despite being formulated to handle a large degree of parallelism, Moss et al. [42] reported that GIBBON fails in practice for large batches with Q > 50. Potentially, this behaviour is a consequence of the accuracy of the lower bound approximation. A heuristic scaling of the batch diversity was proposed to improve performance with large batches. GIBBON may also be interpreted as a determinantal point process (DPP) [4, 43]. In Appendix B we provide a detailed discussion of the relationship of the BEEBO acquisition function to GIBBON and DPPs. Note that while we will also make use of the term entropy in BEEBO, the quantity is distinct from the ones leveraged by the aforementioned approaches in the sense that it does not relate to an unknown optimum.

Thompson sampling Given the challenges of generalizing acquisition functions to batch mode, Thompson sampling (TS), which was originally adopted from bandit problems [44, 45, 46, 47, 48], is a popular alternative strategy for guiding batched BO. While being an attractive approach in general, it has been demonstrated that default TS can become too exploitative, motivating the use of alternatives such as Bayesian Quadrature [49], or advanced strategies on top of TS that ensure diversity [18]. Eriksson et al. [50] demonstrate that overexploration also can be problematic in higher dimensions, and alleviate this using local trust regions in Tu RBO. Maintaining such regions with high precision discretization can be memory-expensive, as indicated by [51], who suggest using MCMC-BO with adaptive local optimization to address this by transitioning a set of candidate points towards more promising positions.

3 The BEEBO acquisition function

Assume ftrue : X R is some real output associated with the input and a set of data be given D = {(xi, yi)}N i=1 where yi R represent some noisy observations of ftrue(xi), say

yi = ftrue(xi) + ϵi (1)

with ϵi denoting the measurement noise. Let x = (x1, , x Q) XQ represent a collection of test points we wish to assign an acquisition value to. In keeping with the BO framework, we assume a given posterior probability distribution over the surrogate function f evaluated at x,

f(x) P(f | D, x) (2)

The lack of knowledge we have of the surrogate function at x is quantified by the differential entropy H:

H f | D, x = Z P f | D, x ln P f | D, x df (3)

This entropy can be contrasted with the expected entropy of the surrogate function if Q observations y = (y1, , y Q) were acquired at x, i.e. if the training data D would be augmented with D (y) = {(xq, yq)}Q q=1 to form the joint data set Daug(y) = D, D (y) . We refer to this entropy as Haug:

Haug f | D, x = Z P(y | D, x)H f | Daug(y) dy, (4)

where P(y | D, x) represent the posterior predictive distribution at x. The expected information gain, I(x), from acquiring observations at x is given by the expected reduction of entropy from this process: I(x) = H f | D, x Haug f | D, x (5)

We propose to represent the explore component of the acquisition function, a BEEBO, by I(x). The information gain I(x) is distinct from the quantities exploited by entropy search approaches, as it quantifies global uncertainty reduction, rather than estimating the information over an unknown x . The information gain is directly applicable to multivariate functions and to heteroskedastic settings where σ2 = σ2(x). Since large measurement uncertainties imply smaller information gain, a BEEBO exhibits risk-averse behaviour [11] by automatically prioritizing regions of small uncertainties from where more precise information of ftrue can be obtained, everything else being equal.

The exploit component of BEEBO relies on taking expectation values of a scalar function of the random variable f(x), E : RQ R, that summarizes the optimality properties of a given batch x. Natural choices would be the mean or the maximum of f(x). Of particular interest is expressing the optimality as a softmax-weighted sum over f(x), as this allows us to smoothly interpolate between the two regimes:

E(x) = E[ E(x)] Q = E

q=1 softmax(βf)qfq

where β is the softmax inverse temperature. At β = 0, we recover the mean. We scale the expectation with Q so that both I and E scale linearly with increasing batch size. While the mean provides a closed form expression for its expectation, this is not the case for the general softmax-weighted sum of a multivariate normal. Using Taylor expansion, we introduce an approximation of the expectation of the softmax-weighted sum that is fully differentiable and can be computed in closed form. A detailed derivation is provided in Appendix A. At β = 0, all Q points contribute equally to E(x), whereas at β > 0, points that do not compete for optimality are dynamically released. This effect can be quantified as the effective number of points via the entropy of the softmax weights. In the following, we will refer to the (exact) β = 0 limit as mean BEEBO, and the (approximated) general case as max BEEBO.

The BEEBO acquisition function then takes the form

a BEEBO(x) = E(x) + T I(x), (7)

where T sets the balance between exploitation (small T) and exploration (large T). As both E and I scale with the batch size Q, a given choice of T would set the explore-exploit balance in an approximately Q-independent manner. This acquisition function bears a strong similarity to the definition of (negative) free energies in statistical physics, where E and I correspond to respectively the thermodynamic energy and entropy of the system and T corresponds to the temperature.

3.1 BEEBO with Gaussian processes

Gaussian processes offer a particular convenient framework for BO, due to the availability of closedform expressions for the inference step [17]. Specifically

P(f | D, x) = N(f | µ(x), C(x))

µ(x) = K(x, x D) M 1 D y D C(x) = K(x, x) K(x, x D) M 1 D K(x D, x)

MD = K(x D, x D) + σ2(x D) (8)

where N( | µ, C) is the multivariate Gaussian distribution with mean µ, and covariance C, x D and y D are the x and y values of the acquired data, σ(x D) = diag σ2 1, , σ2 N is a diagonal matrix with the measurement uncertainties in the diagonal and K( , ) are matrices derived from the GP-kernel, k( , ), i.e. K(x, x )ij = k(xi, x j). It is worth noting that C(x) only depends on the input location of the test points x and the data points x D with their corresponding measurement uncertainties, σ2(x D), but not on the actual observations, y D. Consequently, the entropy of the posterior distribution

H f | D, x = Q

2 ln(2πe) + 1

2 ln det(C(x)) (9)

is independent of y D as well, with ln det denoting the log determinant. Similarly, the expected entropy of f if observations at x were acquired, simply reads

Haug f | D, x = Q

2 ln(2πe) + 1

2 ln det(Caug(x)), (10)

Caug(x) = K(x, x) K(x, xaug) M 1 aug K(xaug, x)

Maug = K(xaug, xaug) + σ2(xaug) (11)

and xaug = x Daug. The BEEBO acquisition function is then given by

a BEEBO(x) = E[ E(x)] Q + T I(x) (12)

where the expectation is either the mean, 1 Q PQ q=1 µq, or the closed form approximation of the softmax-weighted sum described in Appendix A, and

2 ln det(C(x)) 1

2 ln det(Caug(x)). (13)

Algorithm 1: mean BEEBO optimization

Input: model GP, initial batch points x, temperature T repeat

Calculate µ(x), C(x) from Equation 8 using GP E PQ q=1 µq GPaug fantasize(GP, x) Calculate Caug(x) from Equation 11 using GPaug I 1

2 ln det (C(x)) 1

2 ln det (Caug(x)) a E + T I x x + γ a until converged Output: optimized batch points x

All operations needed to compute the acquisition value a BEEBO(x) are analytical. Using automatic differentiation, the batch of points x can therefore be optimized with gradient-based methods, as laid out for mean BEEBO in Algorithm 1, with learning rate γ. In the pseudocode, GP denotes a trained GP model object that holds the training data and the kernel function. Using the kernel s learned amplitude A, we can relate BEEBO s T parameter to the κ of UCB. This allows us to configure BEEBO using a scaled temperature T that ensures both methods have equal gradients at isosurfaces, enabling the user to follow existing guidance and intuition from UCB to control the trade-off. A derivation is provided in Appendix B.1.

4 Experiments

Table 1: Overview of the test problems used in the experiments.

Function Dimension

Ackley 2, 10, 20, 50, 100 Shekel 4 Hartmann 6 Cosine 8 Rastrigin 2, 10, 20, 50, 100 Rosenbrock 2, 10, 20, 50, 100 Styblinski-Tang 2, 10, 20, 50, 100 Powell 10, 20, 50, 100 Embedded Hartmann 6 100

Test problems We benchmark acquisition function performance on a range of maximization test problems with varying dimensions (Table 1) available in Bo Torch [52]. Test problems that are evaluated on multiple dimensions support specifying the respective arbitrary d. As a high-dimensional problem with low inherent dimensionality, we embed the six-dimensional Hartmann function in d = 100 [50, 53, 54]. We additionally test on two robot control problems (robot arm pushing and rover trajectory planning) in Appendix D.3 [55, 56].

On each test problem, we perform 10 rounds of BO using q-UCB or BEEBO with a given explore-exploit parameter for direct comparison. We use the scaled temperature T (B.1) to ensure that both methods operate at the same trade-off. In round 0, we seed the surrogate GP with Q random points that were drawn so that each point has a minimum distance of 0.5 to the test problem s true optimum. We perform ten replicate runs for each problem and method, with replicate seeds controlled so that all methods start from the same Q random points in a replicate. As we evaluate performance in a fixed-round, fixed-Q optimization scenario, we set the explore hyperparameter to 0 in the last round (for max BEEBO, we also set the softmax β to 0). We use Q = 100 for all experiments, which is commonly understood to be a large batch size [50]. Additional results on small batch sizes (5, 10) are provided in Appendix D.2. All experiments use Bo Torch s default utilities for acquisition function optimization and GPy Torch [57] GP training (C.1).

Heteroskedastic noise We investigate performance when optimizing under heteroskedastic noise on the 2D Branin function with three global optima. To construct a heteroskedastic problem, we specify noise so that the noise level is maximal at optima 2 and 3, decaying exponentially with distance to any of the two noised optima (C.4). No noise maximum is added at optimum 1. Therefore, while all three optima share the same ftrue(x) (Figure A1), only optimum 1 is favorable in terms of heteroskedastic risk. We perform BO for ten rounds with β = 0.1 and Q = 10 using a heteroskedastic GP that learns surrogate models for both ftrue(x) and σ2(x). We report results over five replicate runs.

Metrics We report the mean best observed objective value after 10 rounds over the five replicates. As test problems have highly varying scales, we normalize the results on each test problem using min-max normalization. Typically, the minimum of a maximization problem is not known explicitly. We therefore set the minimum for normalization to the highest value observed among the random seed points. The maximum is given by the ftrue(x ) of the problem. The metric thus directly quantifies how much progress has been made to the true optimum from the random starting configuration on a 0-1 scale.

As we are not only interested in identifying a single x with good ftrue(x), we additionally quantify the overall quality of the final (exploitative) batch. We compute the batch instantaneous regret R = P

q<Q ftrue(x ) ftrue(xq) of the last, exploitative batch. To bring results on all test problems to a similar scale, we divide it by the batch instantaneous regret of a batch of Q random points on each problem. We refer to this metric as the relative batch instantaneous regret, Rrel = Rt=10/Rrandom.

For BO under heteroskedastic noise, we wish to quantify the preference of a given method for different optima. As optima share the same ftrue(x ), metrics operating on ftrue(x) are inherently unsuitable, and preference needs to be evaluated on x directly. For each acquired point xi, we compute the distances xi x j 2 to the J individual optima. We report the mean distance to each optimum over all points in a batch.

Table 2: Highest observed value after 10 rounds of BO with Q = 100. The best value at each κ is indicated in blue. BEEBO is configured with T = 1/2 κ. Full BO curves are provided in D.6, confidence intervals and statistical tests in Tables A2, A3 and A4

Problem d κ = 0.1 κ = 1.0 κ = 10.0

mean BEEBO max BEEBO q-UCB mean BEEBO max BEEBO q-UCB mean BEEBO max BEEBO q-UCB

Ackley 2 0.993 0.982 0.973 0.985 0.980 0.967 0.975 0.988 0.988 Levy 2 1.000 1.000 1.000 0.999 1.000 1.000 0.999 0.998 0.998 Rastrigin 2 0.981 0.993 0.951 0.989 0.983 0.983 0.983 0.993 0.933 Rosenbrock 2 0.976 0.982 0.949 0.956 0.979 0.943 0.955 0.938 0.962 Styblinski-Tang 2 0.961 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 Shekel 4 0.540 0.300 0.244 0.915 0.378 0.330 0.698 0.411 0.264 Hartmann 6 1.000 0.894 0.918 1.000 0.976 0.950 0.986 0.974 0.889 Cosine 8 1.000 0.999 0.934 0.999 0.972 0.924 0.619 0.895 0.621 Ackley 10 0.915 0.819 0.800 0.908 0.736 0.772 0.822 0.546 0.513 Levy 10 0.989 0.966 0.904 0.966 0.953 0.904 0.966 0.914 0.560 Powell 10 0.987 0.951 0.920 0.970 0.949 0.916 0.861 0.909 0.283 Rastrigin 10 0.463 0.558 0.420 0.536 0.573 0.522 0.595 0.590 0.311 Rosenbrock 10 0.994 0.991 0.966 0.991 0.986 0.971 0.904 0.975 0.645 Styblinski-Tang 10 0.837 0.822 0.309 0.835 0.638 0.492 0.289 0.229 0.049 Ackley 20 0.827 0.818 0.741 0.851 0.781 0.753 0.777 0.404 0.474 Levy 20 0.949 0.945 0.926 0.943 0.904 0.900 0.889 0.907 0.819 Powell 20 0.955 0.939 0.948 0.965 0.913 0.913 0.872 0.915 0.845 Rastrigin 20 0.399 0.484 0.423 0.473 0.472 0.480 0.508 0.522 0.401 Rosenbrock 20 0.993 0.992 0.973 0.995 0.983 0.982 0.907 0.933 0.924 Styblinski-Tang 20 0.737 0.667 0.203 0.689 0.394 0.561 0.330 0.274 0.034 Ackley 50 0.235 0.623 0.638 0.342 0.594 0.759 0.823 0.465 0.730 Levy 50 0.940 0.971 0.948 0.965 0.958 0.951 0.943 0.879 0.941 Powell 50 0.954 0.975 0.970 0.982 0.969 0.961 0.950 0.938 0.980 Rastrigin 50 0.322 0.472 0.431 0.476 0.470 0.397 0.432 0.439 0.481 Rosenbrock 50 0.971 0.976 0.968 0.984 0.981 0.986 0.983 0.962 0.981 Styblinski-Tang 50 0.584 0.509 0.312 0.675 0.342 0.694 0.393 0.325 0.356 Ackley 100 0.277 0.417 0.708 0.190 0.540 0.645 0.863 0.682 0.844 Emb. Hartmann 6 100 0.951 0.957 0.896 0.987 0.936 0.913 0.928 0.907 0.916 Levy 100 0.837 0.966 0.950 0.961 0.950 0.934 0.944 0.940 0.964 Powell 100 0.810 0.985 0.982 0.952 0.980 0.981 0.983 0.979 0.984 Rastrigin 100 0.497 0.441 0.446 0.401 0.455 0.442 0.459 0.455 0.443 Rosenbrock 100 0.822 0.953 0.972 0.971 0.976 0.969 0.980 0.970 0.978 Styblinski-Tang 100 0.537 0.423 0.308 0.474 0.353 0.532 0.401 0.296 0.278

Mean 0.795 0.811 0.758 0.828 0.790 0.801 0.788 0.744 0.678 Median 0.940 0.951 0.920 0.961 0.949 0.913 0.889 0.907 0.819

5.1 BO on test problems

We benchmark BEEBO against q-UCB at three rates of κ. Overall, we find that the simpler mean BEEBO variant outperforms max BEEBO in terms of mean performance on all but the lowest rate of κ (Table 2). As we consider the configuration with the lowest rate to be exploit-dominated, this can be understood as a consequence of max BEEBO effectively releasing non-contributing points for further exploration, with the low explore rate seeming sufficient to induce the necessary diversity.

While results on individual test problems vary, mean BEEBO shows improved performance over q-UCB especially in the medium dimension range up to 50. For d=100, we find mixed performance, with mean BEEBO gradually becoming more competitive with increasing κ. This is due to the fact that with increasing dimensionality, more exploration is beneficial for learning a good surrogate model before an actual BO process becomes effective. As we find that q-UCB inherently performs more random-like sampling at large Q, irrespective of κ, it benefits in such situations.

On average over all 33 performed experiments, mean BEEBO improves upon q-UCB for large batches at any of the three rates (Table A3). We additionally benchmarked BEEBO against other popular BO strategies without explore-exploit hyperparameters. Interestingly, we found that the Kriging Believer (KB) iterative heuristic [33] can perform very competitively for large batches when using Log EI [58] as the acquisition function (Table A1), especially on the two robot control problems (Appendix D.3), but can be slower to optimize than BEEBO (Table A9).

When evaluating Rrel in the ultimate round of BO, we find that BEEBO allows us to effectively acquire a batch with high ftrue(x), highlighting the controllability of the acquisition function (Table 3). The Rrel of q-UCB is only slightly better than the R of a random batch in many cases, even though the explore component was explicitly set to 0. We note that this is not due to the surrogate function being unsuitable - the results in Table 2 indicate that in most cases the location of ftrue(x ) is approximately

Table 3: Relative batch instantaneous regret Rrel in round 10 (κ = 0) with Q = 100. The best value at each κ is indicated in blue. BEEBO is configured with T = 1/2 κ. Lower means better.

Problem d κ = 0.1 κ = 1.0 κ = 10.0

mean BEEBO max BEEBO q-UCB mean BEEBO max BEEBO q-UCB mean BEEBO max BEEBO q-UCB

Ackley 2 0.292 0.259 1.006 0.268 0.245 0.999 0.257 0.165 1.002 Levy 2 0.134 0.114 1.236 0.092 0.102 1.046 0.102 0.111 1.114 Rastrigin 2 0.455 0.578 1.010 0.425 0.454 0.999 0.407 0.500 1.020 Rosenbrock 2 0.001 0.004 0.992 0.001 0.004 1.094 0.002 0.002 1.014 Styblinski-Tang 2 0.168 0.172 1.024 0.169 0.170 1.027 0.170 0.170 1.051 Shekel 4 0.810 0.776 0.993 0.688 0.730 0.995 0.688 0.695 0.988 Hartmann 6 0.060 0.229 0.968 0.078 0.086 0.971 0.100 0.098 0.862 Cosine 8 0.045 0.001 0.953 0.001 0.016 0.975 0.222 0.061 0.922 Ackley 10 0.478 0.338 0.931 0.314 0.345 0.943 0.253 0.452 0.950 Levy 10 0.041 0.030 1.188 0.023 0.048 1.011 0.261 0.103 1.111 Powell 10 0.016 0.027 1.037 0.009 0.067 1.101 0.067 0.151 1.215 Rastrigin 10 0.629 0.563 0.920 0.523 0.541 0.907 0.567 0.402 0.905 Rosenbrock 10 0.002 0.013 0.906 0.004 0.015 0.770 0.074 0.052 0.918 Styblinski-Tang 10 0.196 0.220 1.174 0.223 0.337 1.126 0.559 0.496 1.219 Ackley 20 0.629 0.219 0.945 0.282 0.292 0.950 0.226 0.586 0.917 Levy 20 0.128 0.241 0.839 0.063 0.113 0.914 0.140 0.182 1.056 Powell 20 0.093 0.081 0.809 0.010 0.074 0.689 0.028 0.110 0.870 Rastrigin 20 0.686 0.600 0.864 0.610 0.635 0.838 0.541 0.555 0.852 Rosenbrock 20 0.047 0.105 0.591 0.004 0.048 0.578 0.036 0.051 0.903 Styblinski-Tang 20 0.426 0.398 1.113 0.378 0.504 1.107 0.691 0.578 1.177 Ackley 50 0.895 0.464 0.949 0.738 0.606 0.947 0.177 0.530 0.874 Levy 50 0.055 0.029 0.611 0.033 0.085 0.681 0.051 0.268 0.892 Powell 50 0.018 0.021 0.542 0.014 0.078 0.499 0.018 0.064 0.785 Rastrigin 50 0.793 0.573 0.813 0.653 0.592 0.810 0.795 0.585 0.768 Rosenbrock 50 0.016 0.021 0.539 0.049 0.031 0.520 0.010 0.048 0.594 Styblinski-Tang 50 0.463 0.478 1.012 0.676 0.574 1.196 0.681 0.727 0.981 Ackley 100 0.718 0.636 0.948 0.900 0.466 0.935 0.137 0.321 0.863 Emb. Hartmann 6 100 0.068 0.144 0.573 0.035 0.086 0.863 0.175 0.172 0.692 Levy 100 0.119 0.031 0.615 0.044 0.164 0.716 0.042 0.056 0.586 Powell 100 0.094 0.011 0.465 0.027 0.041 0.493 0.013 0.018 0.524 Rastrigin 100 0.506 0.501 0.759 0.604 0.557 0.832 0.540 0.544 0.780 Rosenbrock 100 0.114 0.044 0.518 0.027 0.048 0.589 0.014 0.031 0.507 Styblinski-Tang 100 0.389 0.522 0.924 0.503 0.582 1.203 0.562 0.742 0.930

Mean 0.291 0.256 0.872 0.257 0.265 0.889 0.261 0.292 0.904 Median 0.134 0.219 0.931 0.092 0.164 0.943 0.175 0.172 0.917

known by round 10. Rather, we assume that this a consequence of the challenges of MC-based optimization of the acquisition function at large Q.

5.2 BO under heteroskedastic noise

We compare performance of mean BEEBO and q-UCB on the 3-optimum Branin function. Under heteroskedastic noise, we find that BEEBO preferentially optimizes towards the low-noise optimum 1 at the expense of the noisy optima 2 and 3 (Figure 2) and is therefore risk-averse. In the homoskedastic case, BEEBO does not exhibit this preference and optimizes for multiple optima. As expected, q-UCB, which only uses the model posterior variance p

C(x) instead of quantifying the actual information gain, does not display any preference for low-noise optima, showing similar behaviour under heteroskedastic and homoskedastic noise and remaining risk-neutral.

6 Discussion

We introduce BEEBO, an acquisition function for BO with GPs that can be optimized analytically and that scales natively to batched acquisition. By exploiting the independence of the information gain I(x) on measurements y when using GP surrogates, BEEBO models the interdependence of unknown points x in a batch and can optimize their positions jointly using gradient descent.

BEEBO enables full control of its explore-exploit trade-off using a hyperparameter T that directly balances two terms, akin to UCB. Unlike in the reparametrization-based q-methods, BEEBO s T has predictable behaviour also at increasing batch sizes.

The numerical complexity of BEEBO is dominated by the need to compute the inverse of Maug in Equation 11, which in a plain implementation scales as O((N + Q)3). However, this can be reduced to O(N 2Q); specifically, the Cholesky decomposition of Maug can be expressed as Q rank-1 updates of the pre-computed Cholesky decomposition of MD, where each update will have the complexity of O(N 2). The calculation of the energy, E, and the information gain, I, scales as O(N Q) and O(Q3), respectively, and are thus sub-dominant to the update needed for M 1 aug . For large N this

Figure 2: Mean distances of acquired points to the different optima of the Branin function. Under heteroskedastic noise, BEEBO is risk-averse and preferentially optimizes towards the low-noise optimum 1. Under homoskedastic noise, there is no preference. q-UCB does not adapt its behaviour to noise, remaining risk-neutral. The means and standard deviations over five replicates are shown.

approach may nevertheless become prohibitively slow. To overcome this limitation, methods for scalable GPs and fast predictive covariances such as LOVE [59] can be considered. The LOVE method allows a further reduction of the complexity of the Cholelsky update of Maug to O(N r Q) [60], where r is the rank of the LOVE approximation for MD, typically r N.

As opposed to q-UCB, BEEBO can take heteroskedastic noise into account when computing the information gain, and preferentially acquires more informative low-noise points. We note that when the noise function σ2(x) is unknown, and needs to be explored at the same time with ftrue(x), it is critical that the initial random points sufficiently capture the noise landscape well enough for the information gain component to be useful, as the uncertainty of the surrogate on σ2(x) is not used. This would require a fully Bayesian approach that integrates over the distribution of σ2(x). The problem does not arise if the heteroskedastic noise of an experiment is known beforehand by e.g. instrument calibration. While not the focus of this work, we note that using the information gain could also be beneficial in sequential single-sample BO on heteroskedastic problems.

In our experiments, we have focused on maintaining consistent explore-exploit ratios throughout the optimization rounds to ensure an equitable experimental comparison with q-UCB and demonstrate the effect of the hyperparameter choice. However, a more dynamic approach involving variable ratios could be more effective in real-world applications with a predetermined number of rounds [61]. Adopting a fully Bayesian perspective, one could consider the temperature hyperparameter T as a random variable. This opens up an intriguing avenue for BEEBO, where T could be drawn from a prior distribution that e.g. varies across optimization rounds, depending on the specific application. By tailoring this distribution, one could encourage a high level of exploration in the initial rounds, gradually transitioning towards a more exploitation-focused approach towards the end. In the presented experiments, we have implemented this as a strict constraint, maintaining a fixed T until the final round, at which point we shift to full exploitation, i.e., T = 0.

While not explored in this work, we note that the BEEBO expression could naturally be extended to multi-objective optimization problems by capitalizing on GPs that handle vector-valued functions, such as multi-task GPs [12, 62]. Through e.g. the usage of the intrinsic model of coregionalization, we obtain a covariance function k, and thereby a covariance matrix C(x), over all input-task pairs. As the multi-task covariance matrix is jointly Gaussian, the expression of the information gain remains unchanged and can be computed like in the single-task case. The energy E(x) becomes vector-valued, providing an energy term for each of the tasks. This would allow for the introduction of task-specific

weights in the acquisition function. As the extension only affects the surrogate model, the scaling remains cubic in the number of input-task observations.

Beyond GPs, BEEBO could be generalized to work with any probabilistic model. However, GPs are unique in that Haug is available in closed form and can be used to compute I(x) analytically, without solving the integral over y in Equation 4. Other models may require approximations and sampling-based approaches for computing the information gain.

Availability

A Bo Torch implementation of BEEBO is available at

https://github.com/novonordisk-research/BEE-BO.

Acknowledgments and Disclosure of Funding

We thank Christoffer Riis, Jan C. Refsgaard and Kilian W. Conde-Frieboes for helpful discussions. FT was funded in part by the Novo Nordisk Foundation through the Center for Basic Machine Learning Research in Life Science (NNF20OC0062606).

[1] H. J. Kushner. A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise. Journal of Basic Engineering, 86(1):97 106, 03 1964.

[2] Jonas Moˇckus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference: Novosibirsk, July 1 7, 1974, pages 400 404. Springer, 1975.

[3] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms. Advances in neural information processing systems, 25, 2012.

[4] Jesse Dodge, Kevin Jamieson, and Noah A Smith. Open loop hyperparameter optimization and determinantal point processes. ar Xiv preprint ar Xiv:1706.01566, 2017.

[5] Ryan-Rhys Griffiths and José Miguel Hernández-Lobato. Constrained bayesian optimization for automatic chemical design using variational autoencoders. Chemical science, 11(2):577 586, 2020.

[6] Ji Won Park, Samuel Stanton, Saeed Saremi, Andrew Watkins, Henri Dwyer, Vladimir Gligorijevic, Richard Bonneau, Stephen Ra, and Kyunghyun Cho. Property DAG: Multi-objective Bayesian optimization of partially ordered, mixed-variable properties for biological sequence design. ar Xiv, 2022.

[7] Samuel Stanton, Wesley Maddox, Nate Gruver, Phillip Maffettone, Emily Delaney, Peyton Greenside, and Andrew Gordon Wilson. Accelerating Bayesian optimization for biological sequence design with denoising autoencoders. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 20459 20478. PMLR, 17 23 Jul 2022.

[8] Peter I Frazier. A Tutorial on Bayesian Optimization. ar Xiv, 2018.

[9] Samuel Daulton, Sait Cakmak, Maximilian Balandat, Michael A Osborne, Enlu Zhou, and Eytan Bakshy. Robust Multi-Objective Bayesian Optimization Under Input Noise. ar Xiv, 2022.

[10] Benjamin Letham, Brian Karrer, Guilherme Ottoni, and Eytan Bakshy. Constrained bayesian optimization with noisy experiments, 2018.

[11] Anastasia Makarova, Ilnura Usmanova, Ilija Bogunovic, and Andreas Krause. Risk-averse heteroscedastic bayesian optimization. Advances in Neural Information Processing Systems, 34:17235 17245, 2021.

[12] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. Advances in neural information processing systems, 26, 2013.

[13] Shion Takeno, Hitoshi Fukuoka, Yuhki Tsukada, Toshiyuki Koyama, Motoki Shiga, Ichiro Takeuchi, and Masayuki Karasuyama. Multi-fidelity bayesian optimization with max-value entropy search and its parallelization. In International Conference on Machine Learning, pages 9334 9345. PMLR, 2020.

[14] Riccardo Moriconi, Marc P Deisenroth, and K S Sesh Kumar. High-dimensional Bayesian optimization using low-dimensional feature spaces. ar Xiv, 2019.

[15] Javad Azimi, Alan Fern, and Xiaoli Fern. Batch bayesian optimization via simulation matching. Advances in Neural Information Processing Systems, 23, 2010.

[16] Tarun Kathuria, Amit Deshpande, and Pushmeet Kohli. Batched gaussian process bandit optimization via determinantal point processes. Advances in neural information processing systems, 29, 2016.

[17] Christopher Williams and Carl Rasmussen. Gaussian processes for regression. Advances in neural information processing systems, 8, 1995.

[18] Natalie Maus, Kaiwen Wu, David Eriksson, and Jacob Gardner. Discovering many diverse solutions with bayesian optimization. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 1779 1798. PMLR, 25 27 Apr 2023.

[19] Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13:455 492, 1998.

[20] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

[21] Peter Frazier, Warren Powell, and Savas Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599 613, 2009.

[22] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

[23] James T Wilson, Riccardo Moriconi, Frank Hutter, and Marc Peter Deisenroth. The reparameterization trick for acquisition functions. ar Xiv, 2017.

[24] James T Wilson, Frank Hutter, and Marc Peter Deisenroth. Maximizing acquisition functions for Bayesian optimization. ar Xiv, 2018.

[25] Amar Shah and Zoubin Ghahramani. Parallel Predictive Entropy Search for Batch Global Optimization of Expensive Objective Functions. Advances in neural information processing systems, 28, 2015.

[26] David Ginsbourger, Janis Janusevskis, and Rodolphe Le Riche. Dealing with asynchronicity in parallel Gaussian process based global optimization. Ph D thesis, Mines Saint-Etienne, 2011.

[27] Enrico Crovini, Simon L Cotter, Konstantinos Zygalakis, and Andrew B Duncan. Batch bayesian optimization via particle gradient flows. ar Xiv preprint ar Xiv:2209.04722, 2022.

[28] J. Dick, F. Y. Kuo, and I. H. Sloan. High-dimensional integration: the quasi-monte carlo way. Acta Numerica, 22:133 288, 2013.

[29] R. E. Caflisch. Monte carlo and quasi-monte carlo methods. Acta Numerica, 7:1 49, 1998.

[30] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28, 2015.

[31] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278 1286. PMLR, 2014.

[32] András Sóbester, Stephen J Leary, and Andy J Keane. On the design of optimization strategies based on global response surface approximation models. Journal of Global Optimization, 33:31 59, 2005.

[33] David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. Kriging Is Well-Suited to Parallelize Optimization, pages 131 162. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.

[34] Thomas Desautels, Andreas Krause, and Joel W. Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, 15(119):4053 4103, 2014.

[35] Javier Gonzalez, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch bayesian optimization via local penalization. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 648 657, Cadiz, Spain, 09 11 May 2016. PMLR.

[36] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel gaussian process optimization with upper confidence bound and pure exploration. In Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, and Filip Železný, editors, Machine Learning and Knowledge Discovery in Databases, pages 225 240, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.

[37] Javier Gonzalez, Michael Osborne, and Neil Lawrence. Glasses: Relieving the myopia of bayesian optimisation. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 790 799, Cadiz, Spain, 09 11 May 2016. PMLR.

[38] Philipp Hennig and Christian J. Schuler. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(57):1809 1837, 2012.

[39] José Miguel Hernández-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. Ar Xiv, abs/1406.2541, 2014.

[40] Zi Wang and Stefanie Jegelka. Max-value entropy search for efficient bayesian optimization. In International Conference on Machine Learning, pages 3627 3635. PMLR, 2017.

[41] Henry B Moss, David S Leslie, and Paul Rayson. Mumbo: Multi-task max-value bayesian optimization. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14 18, 2020, Proceedings, Part III, pages 447 462. Springer, 2021.

[42] Henry B. Moss, David S. Leslie, Javier I. González, and Paul Rayson. Gibbon: General-purpose information-based bayesian optimisation. Ar Xiv, abs/2102.03324, 2021.

[43] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Found. Trends Mach. Learn., 5:123 286, 2012.

[44] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

[45] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 844 853. PMLR, 06 11 Aug 2017.

[46] Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabás Póczos. Parallelised bayesian optimisation via thompson sampling. In International Conference on Artificial Intelligence and Statistics, pages 133 142. PMLR, 2018.

[47] Vahab Mirrokni, Mohammad Shadravan, et al. Parallelizing thompson sampling. In Advances in Neural Information Processing Systems, 2021.

[48] Amin Karbasi, Vahab Mirrokni, and Mohammad Shadravan. Parallelizing thompson sampling. Advances in Neural Information Processing Systems, 34:10535 10548, 2021.

[49] Masaki Adachi, Satoshi Hayakawa, Saad Hamid, Martin Jørgensen, Harald Oberhauser, and Micheal A Osborne. Sober: Scalable batch bayesian optimization and quadrature using recombination constraints. ar Xiv preprint ar Xiv:2301.11832, 2023.

[50] David Eriksson and Martin Jankowiak. High-dimensional Bayesian optimization with sparse axis-aligned subspaces. In Cassio de Campos and Marloes H. Maathuis, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 of Proceedings of Machine Learning Research, pages 493 503. PMLR, 27 30 Jul 2021.

[51] Zeji Yi, Yunyue Wei, Chu Xin Cheng, Kaibo He, and Yanan Sui. Improving sample efficiency of high dimensional bayesian optimization with mcmc. ar Xiv preprint ar Xiv:2401.02650, 2024.

[52] Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy. Botorch: a framework for efficient monte-carlo bayesian optimization. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc.

[53] Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando De Freitas. Bayesian optimization in a billion dimensions via random embeddings. J. Artif. Int. Res., 55(1):361 387, jan 2016.

[54] Leonard Papenmeier, Luigi Nardi, and Matthias Poloczek. Increasing the scope as you learn: Adaptive bayesian optimization in nested subspaces. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.

[55] Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scale bayesian optimization in high-dimensional spaces. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 745 754. PMLR, 09 11 Apr 2018.

[56] Zi Wang, Chengtao Li, Stefanie Jegelka, and Pushmeet Kohli. Batched high-dimensional bayesian optimization via structural kernel learning. In International Conference on Machine Learning, 2017.

[57] Jacob R. Gardner, Geoff Pleiss, David Bindel, Kilian Q. Weinberger, and Andrew Gordon Wilson. Gpytorch: blackbox matrix-matrix gaussian process inference with gpu acceleration. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, page 7587 7597, Red Hook, NY, USA, 2018. Curran Associates Inc.

[58] Sebastian Ament, Sam Daulton, David Eriksson, Maximilian Balandat, and Eytan Bakshy. Unexpected improvements to expected improvement for bayesian optimization. In Thirtyseventh Conference on Neural Information Processing Systems, 2023.

[59] Geoff Pleiss, Jacob Gardner, Kilian Weinberger, and Andrew Gordon Wilson. Constanttime predictive distributions for Gaussian processes. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4114 4123. PMLR, 10 15 Jul 2018.

[60] Shali Jiang, Daniel R. Jiang, Maximilian Balandat, Brian Karrer, Jacob R. Gardner, and R. Garnett. Efficient nonmyopic bayesian optimization via one-shot multi-step trees. Ar Xiv, abs/2006.15779, 2020.

[61] George De Ath, Richard M. Everson, Alma A. M. Rahat, and Jonathan E. Fieldsend. Greed is good: Exploration and exploitation trade-offs in bayesian optimisation. ACM Trans. Evol. Learn. Optim., 1(1), apr 2021.

[62] Edwin V Bonilla, Kian Chai, and Christopher Williams. Multi-task gaussian process prediction. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.

[63] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9(8):235 284, 2008.

[64] Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, and Ghislain Durif. Kernel operations on the gpu, with autodiff, without memory overflows. Journal of Machine Learning Research, 22(74):1 6, 2021.

Table of Contents

A Approximating the expectation of the softmax weighted sum 15 A.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.3 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.4 Number of effective points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B Relationship to other acquisition strategies 18 B.1 Relationship of BEEBO T and UCB κ hyperparameters . . . . . . . . . . . . . 18 B.2 GIBBON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.3 Determinantal Point Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.4 Local penalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 B.5 RAHBO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C Implementation details 22 C.1 Acquisition function optimization . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.2 Benchmark BO methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.3 Test problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.4 Heteroskedastic noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

D Extended results 24 D.1 Results including additional baselines . . . . . . . . . . . . . . . . . . . . . . . 24 D.2 Results for batch sizes 5 and 10 . . . . . . . . . . . . . . . . . . . . . . . . . . 26 D.3 Control problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 D.4 Run time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 D.5 Results with random initialization in round 0 . . . . . . . . . . . . . . . . . . . 29 D.6 BO curves for all experiments in Table 2 and Table A1 . . . . . . . . . . . . . . 30

A Approximating the expectation of the softmax weighted sum

A.1 Motivation

We are free to choose any energy function E in BEEBO, the only requirement being that we are able to compute an expectation of E in order to obtain the scalar summary E = E[ E(f)]. Of particular interest is the softmax weighted sum,

i=1 softmax(βf)ifi, (A1)

where β is the softmax inverse temperature. The softmax weight vector ω computed as

ωi = exp(βfi) PQ j=1 exp(βfj) + exp(βyymax) (A2)

where exp(βyymax) is an optional reference threshold value, as in expected improvement, which we set to 0 if not used (βy is either simply β or a dynamically scaled value that ensures E(f) does

not become 0, see Equation A28). The parameter β allows us to interpolate between two extreme regimes,

i=1 fi forβ 0 (A3)

E(f) = max(f) forβ (A4)

In the first regime, all points of a batch equally contribute to the energy, whereas in the second regime only the single point "responsible" for the maximum is controlling the energy. Note that for numerical reasons, operating towards the β limit is impractical, as it will lead to zero gradients for all but one point, preventing optimization. We can set β = A 1/2, with A being the prior uncertainty scale of the GP kernel. Since A represents the expected energy fluctuations for points far from data, this weighting scheme will reflect a natural compromise between Equation A3 and Equation A4.

As opposed to the mean, in the general case, the expectation of the maximum of a Q-dimensional multivariate normal is not available in closed form. To our best knowledge, this is also the case for the softmax weighted sum. In the following, we derive a closed-form approximation of the expectation of the softmax of βf that can be used for gradient-based optimization.

A.2 Derivation

Consider the softmax denominator

j=1 exp(βfj) + exp(βyymax). (A5)

We will Taylor expand ln(d) to the second order, using

d exp(βfi) = βωi (A6)

fi fj = β2 1

d2 exp(β(fi + fj) + 1

d2 δij exp(βfi)) (A7)

= β2(ωiδij ωiωj), (A8)

where δij is the Kronecker delta. So

ln(d) ln(da) + βw T f + β2

2 f T W f, (A9)

where a is the Taylor expansion point, da is d evaluated at a, f = f a, w is the Q-dimensional vector ω evaluated at a, and W is the Q Q matrix W = diag(w) ww T . Inserted into Equation A2 we have

ω(f)i wi exp(β fi βw T f β2

2 f T W f) (A10)

With this approximation we can calculate the expectation value

E[ω(f)i fi] = wi p

det(C(x) 1) (2π)Q/2

Z exp(λi(f)) fidf (A11)

λi(f) = β fi βw T f T β2

2 f T W f 1

2(f µ)T C(x) 1 (f µ)

2(f ν(i))T C(x) 1 softmax (f ν(i)), (A13)

Where c(i), ν(i) and C(x)softmax are defined as follows:

C(x)softmax = C(x) 1 + β2W 1 (A14)

b(i) = e(i) w (A15)

ν(i) = C(x)softmax βb(i) + C(x) 1 µ + β2W a (A16)

2(ν(i))T C(x) 1 softmax ν(i) β(b(i))T a (A17)

2 a T W a 1

2µT C(x) 1 µ (A18)

and e(i) is the i th basis vector with components e(i) j = δij. We can avoid the explict use of the precision matrix by rewriting the updated covariance matrix as

C(x)softmax = C(x) 1 (I + β2C(x) W) 1 = U(x) C(x),

where we have defined U(x) = I + β2C(x) W 1. The updated mean vectors can conveniently be expressed as

ν(i) = C(x)softmax (βb(i) + C(x) 1 µ + β2W a) (A19)

ν(i) = βC(x)softmax e(i) + C(x)softmax βw + C(x) 1 µ + β2W a (A20)

ν(i) = βC(x)softmax e(i) + ν , (A21)

where ν is a constant vector for all (i). Similarly

c(i) = βai + 1

2β2 C(x)softmax

i,i + βν i + c (A22)

c = βw T a + 1

2ν T C(x) 1 softmax ν β2

2 a T W a 1

2µT C(x) 1 µ (A23)

with c again being a constant for all (i). The expectation of the softmax weighted summary is given by

i=1 wi exp(c(i)) ν(i) i (A24)

det(C(x) 1)

det(C(x) 1+β2W ) = p

det(U(x)). The most natural choice for the expansion point is

a = µ in which case ν(i) and c(i) reduces to

ν(i) = βC(x)softmax e(i) w + µ (A25)

2 e(i) w T C(x)softmax e(i) w (A26)

A.3 Practical considerations

Linear algebra For numerical reasons, we avoid computing C(x)softmax explicitly, and instead use the U(x) C(x) factorization to compute solutions with U(x) 1 = I + β2C(x) W.

C(x)softmax e(i) w = U(x) C(x) e(i) C(x) w (A27)

Following GPy Torch practices, we make use of the Linear Operator package to exploit the structure of U(x) 1 as an Added Diag Linear Operator when solving. For determinants, we find that Linear Operator s logdet implementation gives nondeterministic results, and we therefore perform a dense cast before computing K using default Pytorch.

While the factorization is numerically advantageous, it is still limited with regards to β. We find that at β > 5, numerical errors prevent a reliable calculation of the expectation. In practice, A 1/2 lies in a range that allows numerically accurate solutions.

Softmax When ymax grows much larger than the softmax input vector f - a situation that can arise easily when initializing with random points for gradient-based optimization - the softmax weights ω can become numerically zero for all "real" points, thus leading to E(x) = 0, and vanishing gradients. As we always wish to preserve a minimal energy contribution from the real points, we parametrize the inverse temperature applied to ymax, βy, using a hyperparameter α that denotes the minimal fraction of probability mass pertaining to real points. This parametrization resembles the Log EI version of the expected improvement acquisition function [58] to address the problem of vanishing EI-gradients.

Let N denote the softmax denominator excluding ymax, N = PQ j=1 exp(β fi). We define

exp(βy ymax) = min 1 α

α N, exp(β ymax) (A28)

We used α = 0.05 as a default in all our experiments.

A.4 Number of effective points

We can interpret the softmax as the number of effective points contributing to the energy of the batch. The entropy H of the softmax is given by

i=1 ωi ln(ωi), (A29)

and the number of effective points, Deff, is exp(H(ω)), so that

i=1 ωi ln(ωi)

Deff is bounded by 1 (approaching the maximum) and Q (approaching the mean). Note that if we include ymax in the softmax denominator, we add ωymax ln(ωymax) to H(ω), and the resulting number becomes bounded by 1 and Q + 1.

B Relationship to other acquisition strategies

In the following section, we will discuss how BEEBO is related to UCB, GIBBON, Determinantal Point Processes (DPP), the Local Penalization heuristic and RAHBO. We will base our analysis on mean BEEBO, as the softmax-mediated interdependency of points in max BEEBO prevents a simple interpretation of the objective in a single-point stepwise manner and does not allow for the same direct analogies to other strategies.

B.1 Relationship of BEEBO T and UCB κ hyperparameters

BEEBO bears some resemblance to the UCB acquisition function, which in the single particle mode, Q = 1, reads a UCB(x) = µ(x) + κ p

C(x), (A31) where the parameter κ controls the balance between exploitation and exploration and µ(x) and C(x) are respectively the mean and variance of the posterior distribution, P(f | x, D), as before. We note that a UCB does not account for the uncertainty of the measurement at x, and therefore remains risk-neutral under heteroskedastic noise [11]. To understand the relationship between BEEBO and UCB, we will therefore limit ourselves to the homoskedastic case and furthermore assume that measurement variances σ2 are much smaller than the typical prior variance of the GP surrogate, A, of f, e.g. A N 1Tr(K), so σ2 A and M 1 = (K + σ2) 1 K 1. In this limit, the variance of f(x) after measurement (indexed at i = n, say) reduces to σ2:

C(x) = (K 1 + σ 2I) 1

nn = K(K + σ2I) 1σ2

nn σ2 (A32)

and the information gain becomes

2 ln(C(x)) log(σ).

Consequently, the gradient of the two acquisition functions reads

a UCB(x) = µ(x) + κ

a BEEBO(x) = µ(x) + T 2 C(x) C(x).

The two gradients will be identical at points x where the posterior uncertainties satisfy p

C(x) = T κ. For comparison, we may desire equal gradients at iso-surfaces corresponding to a given fraction, ν, of the prior uncertainty scale

A, by setting T accordingly as T = ν

A κ. In our experiments, we use ν = 1

2 and configure BEEBO using a dimensionless T explore-exploit parameter, defined as T = T

A, and set T = 1

2 κ for a given benchmark experiment.

GIBBON [42] approximates the (intractable) General-purpose max-value Entropy Search acquisition function, which quantifies the mutual information MI(f true; y|D) of a batch of measurements y and the unknown optimum f true. It does so using a lower bound on the information gain and MC estimation of the expectation over f true. It can be written as

αGIBBON(x) = 1

2 ln det(R) 1 2|M|

i=1 ln 1 ρ2 i ϕ(γi(m)) ϕ(γi(m))

γi(m) + ϕ(γi(m))

αGIBBON(x) = 1

2 ln det(R) +

i=1 ˆαGIBBON(xi), (A33)

where R is the correlation matrix with entries Rij = C(x)ij

C(x)ii C(x)jj , M is a set of samples for

the max-value f true, and ρi is the correlation of yi and ftrue(xi). ϕ and ϕ are the standard normal cumulative distribution and probability density functions, and γi(m) = m µ

The definition of BEEBO introduced in Equation 7, with the scalar summarization function set to the expected mean, E(x) = 1

Q PQ i=1 µ(xi), gives

αBEEBO(x) = T 1

2 (ln det (C(x)) ln det (Caug(x))) +

i=1 µ(xi). (A34)

From the second formulation of GIBBON, it becomes obvious that although being distinct in their motivation and derivation, BEEBO and GIBBON implement acquisition functions with a similar structure. Taking an information theoretic and multi-fidelity BO standpoint, GIBBON refers to this trade-off as diversity against quality, whereas in BEEBO we follow the intuitions of UCB, and use exploration and exploitation.

Quality - Exploitation: GIBBON employs an MC estimate of the lower bound approximation of the information gain provided by each point, whereas BEEBO directly summarizes the optimality of all points in closed form, either as their mean or an approximated softmax weighted sum. Diversity - Exploration: In GIBBON, the diversity derived from the differential entropy H(f|D, x) is the entropy of the posterior correlation 1

2 ln det(R). In BEEBO, we employ the reduction of entropy, the information gain I(x). Under homoskedastic noise, I(x) ln det(C(x)). Since R(x) = diag(C(x)) 1/2 C(x) diag(C(x)) 1/2, we have that ln det(R) = ln det(C(x)) PQ i ln(C(x)ii). Therefore, maximizing the log determinant of R penalizes points that have high variance.

Therefore, while GIBBON presents an attractive approximation of max-value Entropy Search for batched acquisition, BEEBO is an alternative that avoids approximating a quality criterion using MC. Moreover, GIBBON s diversity criterion implicitly penalizes points that have high variance, whereas BEEBO s criterion maximizes the reduction of variance. We find that BEEBO is orders of magnitudes faster to compute than GIBBON (Figure A3).

In the context of large batches (Q >> 10), a modification of GIBBON exists that is further similar to BEEBO. Departing from the strict max-value entropy search derivation, a scaling factor Q 2 is introduced to counteract a growing dominance of the diversity term:

αscaled GIBBON(x) = 1 2Q2 ln det(R) +

i=1 ˆαMES(xi). (A35)

This scaling is motivated by the fact that R contains Q2 elements. However, we note that R is summarized by its log determinant, which scales linearly in Q: As the determinant is the product of the eigenvalues, the log determinant is the sum of the log-eigenvalues. The number of eigenvalues scales linearly with matrix size Q, and so does the log determinant.

B.3 Determinantal Point Processes

A Determinantal Point Process [43] specifies a probability over a set of points, or a "configuration of points" drawn from a ground set. Specifically, the probability of a set of Q points x is given by

P(x) det (Lx) , (A36)

where Lx is a Q Q symmetric matrix. Kulesza et al. [43] provide a decomposition of the general DPP kernel L that makes quality and diversity components explicit, so that

Lij = q(xi)q(xj)k(xi, xj), (A37)

with k being a Rd Rd R+ similarity kernel, and q being a unary Rd R scalar quality function. This framework is naturally amenable to batch BO, as we seek to select a collection of points that trade off quality (optimality) and diversity. Note that both k and q are distinct functions that need to be specified by the user, leading to the practical complication that they must be chosen very carefully so that their scales do not dominate each other, which limits the utility of this decomposition in practice [42].

In the following, we show how BEEBO is equivalent to a DPP, and derive the necessary k and q. Again, we consider BEEBO

αBEEBO(x) = E(x) + T I(x), (A38)

with the scalar summarization function set to E(x) = PQ i=1 f(xi). We will first focus on the information gain term I(x), which we can rearrange as

2 ln det(C(x)) 1

2 ln det(Caug(x)) = 1

2 ln det C(x) C 1 aug(x) . (A39)

Our similarity kernel k is therefore given by the entries of the matrix S = C(x) Caug(x) 1, so that k(xi, xj) = Sij. Note that due to the augmented covariance term, the implied k also depends on all other currently selected points in x, and Lx is not a submatrix of an all-sample L. Therefore, BEEBO does not implement a DPP under heteroskedastic noise. However, if we only consider homoskedastic noise, BEEBO s I(x) simplifies to the posterior entropy [63], and therefore S = C(x). As C(x) can be accessed as a submatrix of an all-sample C, this permits a DPP.

Given the choice of E(f), we can rewrite BEEBO as

αBEEBO(x) = ln det (S) T 1

2 T αBEEBO(x) = ln det (S) +

2 T αBEEBO(x) = ln det (S) + ln det (D)) with D = diag(exp( 2

2 T αBEEBO(x) = ln det D 1 2 S D 1 2 D

T µi) = exp( 1

αBEEBO(x) = ln det D 1 2 S D 1 2 T 1

αBEEBO(x) = ln det (L) T 1

where L is a matrix with entries Lij = Sij exp( 1

T µi) exp( 1

T µj). BEEBO therefore uses the DPP quality function q(xi) = exp( 1

T µi), and, like proven previously for GIBBON, a batch x with maximal αBEEBO corresponds to the MAP of a DPP.

B.4 Local penalization

Local penalization (LP) is a greedy batch selection strategy that given any arbitrary single-point acquisition function, ensures diversity by applying a penalization function ψ(x, x ) that downweights the acquisition value of candidate locations x based on their proximity to already selected points. The criterion for selecting xi is given by

xi = arg max α(x)

j=1 ψ(x, xj). (A41)

Note that in this formulation, the product includes all previously selected points, not just the current batch. The penalization function ψ may in principle be chosen freely. Gonzalez et al. [35] propose exploiting the fact that ftrue is Lipschitz continuous in order to bound the position of the unknown optimum and penalize accordingly. The Lipschitz constant L is inferred from the GP surrogate and used to parametrize ψ. In LP, acquisition function optimization proceeds iteratively. After an xi is chosen, the corresponding penalizing multiplier is added to the objective before optimizing for the next xi+1.

While BEEBO enables optimization to proceed in parallel, it is of course possible to also optimize BEEBO greedily (under homoskedastic noise, I is submodular). In this case, it implements an LP strategy where α(x) = µ(x). Rather than a product of individual Rd Rd R function evaluations, the penalizer implied by BEEBO is the information gain I(x) : Ri d R that we evaluate by concatenating a candidate point to the already acquired x at each iteration. Like in GIBBON, this constitutes an LP strategy that does not require estimation of any properties of ftrue beyond learning the GP surrogate.

Risk-averse Heteroskedastic Bayesian Optimization (RAHBO) [11] is a UCB-derived single-point acquisition function that avoids heteroskedastic risk, preferentially selecting points with low noise. While it is not applicable to batched acquisition directly, we here compare it to single-sample BEEBO to highlight different ways of addressing noise. Given a heteroskedastic surrogate model that learns an additional GP for the noise, the variance proxy, RAHBO reads

αRAHBO(x) = UCBf(x) α LCBvar(x) αRAHBO(x) = µf(x) + βf σf(x) α(µvar(x) βvar σvar(x)) , (A42)

where µf and σf are the posterior mean and variance of the surrogate model and βf is the standard UCB trade-off hyperparameter, yielding the standard upper confidence bound UCBf. α is the chosen risk tolerance, and LCB is the lower confidence bound of the variance GP with posterior mean µvar and variance σvar traded off using βvar.

At Q = 1, BEEBO can be expressed as

αBEEBO = µf(x) + T 1

2 ln(σf(x)) T 1

2 ln(σaug f (x))

αBEEBO = µf(x) + T 1

σf(x) σaug f (x)

where the variance proxy at x is considered via the augmented posterior variance σaug f .

While RAHBO penalizes risk on an absolute scale, subject to α, BEEBO optimizes for high uncertainty reduction, quantified as the log ratio of the variance before and after making measurements.

Moreover, RAHBO differentiates between known and unknown variance proxies, and uses the LCBvar term to discount the predicted variance according to its uncertainty. In its closed-form analytical expression, BEEBO does not permit for the uncertainty of the variance proxy to be taken into account, being more similar to the known variance RAHBO

αRAHBO(x) = µf(x) + βf σf(x) αµvar(x) (A44)

where µvar is a noise-free proxy. Either a sampling-based approach, or approximations to I(x) would need to be introduced to handle variance proxy uncertainty in BEEBO.

C Implementation details

C.1 Acquisition function optimization

BEEBO was implemented for full compatibility with the Bo Torch framework (version 0.9.4) [52] as an Analytic Acquistion Function. Standard Bo Torch utilities for initializing and training GPs, initializing q-batches and performing gradient descent optimization of the acquisition function are used. We trained GPy Torch (version 1.11) [57] GP models with Ke Ops [64] Matérn 5/2 kernels (following Bo Torch defaults with a separate length scale for each input dimension, and Gamma priors on the length and output scales). Log determinants for the information gain were computed using singular value decomposition for numerical stability.

GPy Torch provides a get_fantasy_model method that allows for the efficient augmentation of the training data of a GP with a set of points, as done in BEEBO. However, we observed that GPy Torch s implementation suffers from GPU memory leaks when used with automatic differentiation enabled. We therefore instantiate augmented models explicitly, not making use of the (more efficient) augmentation strategy.

All experiments were performed with double precision. Sobol QMCNormal Sampler was used for acquisition functions making use of the reparametrization trick. Experiments were run on individual Nvidia RTX 6000 and V100 GPUs. Five replicates for the benchmarking experiments required a total of approx. 5,000 RTX 6000 GPU hours, with the majority of the run time dedicated to the GIBBON baseline, rather than BEEBO itself (Figure A3, Table A9).

C.2 Benchmark BO methods

All methods were benchmarked in Bo Torch. For q-EI, we used Log EI [58]. For TS, 10.000 base Sobol samples were drawn and sampled with Max Posterior Sampling using the Cholesky decomposition

of the covariance matrix. GIBBON was optimized using sequential optimization following the Bo Torch tutorial. We additionally implemented a custom version of GIBBON that applies the Q 2 scaling factor to the diversity term, as proposed in GIBBON s supplementary material. We used 100,000 random discretized candidates for max-value sampling. In a few iterations, optimizing GIBBON seemed challenging, with Bo Torch reporting that no nonzero initialization candidate could be identified. KB was optimized using a custom greedy optimization loop with fantasized observations, using (single-sample) Log EI as the underlying acquisition function. Tu RBO-1 was optimized following its Bo Torch tutorial. None of the methods use a hyperparameter for controlling their explore-exploit trade-off. The results are therefore based on 10 iterations at defaults.

C.3 Test problems

Test functions All test functions were used in their Bo Torch implementations. As done in previous work, the embedded Hartmann function was created by appending all-0 dummy dimensions to the original six dimensions [53, 54, 50].

Control problems We consider two control problems from previous work: A 14-dimensional parameter tuning task for controlling robot arms pushing two objects to a target location [55], and a 60-dimensional trajectory planning task for a rover navigating through a maze of obstacles [56]. Instead of converting the problem objectives into rewards as in the original work, we operate on the actual minimization objectives directly (distance to target, navigation loss), and follow Bo Torch s approach of simply inverting the objective in order to yield maximization problems. Both problems were adapted from their available implementations in Wang et al. [56] to follow the Bo Torch test problem API.

C.4 Heteroskedastic noise

The (inverted) Branin function has three global optima f(x ) = 0.397887 at x 1 = (9.42478, 2.475), x 2 = ( π, 12.275) and x 3 = (π, 2.275). We define heteroskedastic noise so that the variance is maximal at x 2 and x 3. The noise decays exponentially with the distance from any of the two noised optima at a rate λ.

σ2(x) = σ2 max exp( λ min( x x 2 2, x x 3 2) (A45)

For our experiments, we set σ2 max = 100 and λ = 0.05. As the surrogate function, we use a Heteroskedastic Single Task GP provided in Bo Torch. This model learns two GPs simultaneously, one for the function f(x) and one for the (also unknown) variance function σ2(x). When querying the oracle with a batch of points, noised observations of f(x) are provided together with the true σ2 at each point. The homoskedastic control experiment uses a Single Task GP with inferred noise level. The homoskedastic noise is set to σ2 = 77.5, which is the average noise level of the heteroskedastic function over the whole domain.

Figure A1: The Branin function with added heteroskedastic noise following Equation A45. σ2 max = 100, λ = 0.05.

D Extended results

D.1 Results including additional baselines

Table A1: BO on noise-free synthetic test problems. The normalized highest observed value after 10 rounds of BO with q=100 is shown. Colors are normalized row-wise. The BEE-BO and q-UCB columns are equivalent to Table 2. Higher means better. Results are means over five replicate runs.

Problem d mean BEEBO max BEEBO q-UCB q-EI TS KB GIBBON Tu RBO

T =0.05 T =0.5 T =5.0 T =0.05 T =0.5 T =5.0 κ=0.1 κ=1.0 κ=10.0 - - - default scaled -

Ackley 2 0.993 0.005 0.985 0.031 0.975 0.035 0.982 0.023 0.980 0.035 0.988 0.013 0.973 0.023 0.967 0.022 0.988 0.011 0.987 0.012 1.000 0.000 0.981 0.014 0.878 0.100 0.951 0.027 0.951 0.027 Levy 2 1.000 0.000 0.999 0.001 0.999 0.001 1.000 0.000 1.000 0.000 0.998 0.002 1.000 0.000 1.000 0.000 0.998 0.002 1.000 0.000 0.999 0.002 1.000 0.000 0.988 0.008 0.993 0.010 0.993 0.010 Rastrigin 2 0.981 0.024 0.989 0.016 0.983 0.016 0.993 0.007 0.983 0.011 0.993 0.006 0.951 0.021 0.983 0.015 0.933 0.025 0.995 0.007 1.000 0.000 0.976 0.021 0.903 0.087 0.944 0.038 0.944 0.038 Rosenbrock 2 0.976 0.045 0.956 0.071 0.955 0.080 0.982 0.032 0.979 0.027 0.938 0.123 0.949 0.074 0.943 0.129 0.962 0.079 0.982 0.029 0.966 0.079 0.976 0.068 0.633 0.355 0.843 0.301 0.843 0.301 Styblinski-Tang 2 0.961 0.072 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.001 1.000 0.000 1.000 0.000 0.999 0.001 1.000 0.000 1.000 0.000 1.000 0.000 0.996 0.003 0.999 0.001 0.999 0.001 Shekel 4 0.540 0.242 0.915 0.082 0.698 0.296 0.300 0.079 0.378 0.230 0.411 0.222 0.244 0.116 0.330 0.212 0.264 0.023 0.515 0.253 0.155 0.033 0.371 0.220 0.187 0.057 0.282 0.076 0.282 0.076 Hartmann 6 1.000 0.000 1.000 0.000 0.986 0.013 0.894 0.073 0.976 0.045 0.974 0.042 0.918 0.058 0.950 0.052 0.889 0.062 0.993 0.014 0.810 0.056 0.993 0.013 0.844 0.073 0.887 0.074 0.887 0.074 Cosine 8 1.000 0.000 0.999 0.001 0.619 0.099 0.999 0.001 0.972 0.016 0.895 0.077 0.934 0.032 0.924 0.032 0.621 0.192 0.802 0.060 1.000 0.000 0.985 0.011 0.900 0.071 0.937 0.046 0.937 0.046 Ackley 10 0.915 0.033 0.908 0.042 0.822 0.034 0.819 0.024 0.736 0.053 0.546 0.073 0.800 0.048 0.772 0.051 0.513 0.140 0.802 0.049 1.000 0.000 0.746 0.045 0.410 0.062 0.548 0.100 0.548 0.100 Levy 10 0.989 0.003 0.966 0.022 0.966 0.032 0.966 0.023 0.953 0.015 0.914 0.045 0.904 0.041 0.904 0.041 0.560 0.238 0.931 0.023 0.958 0.008 0.978 0.012 0.889 0.049 0.881 0.075 0.881 0.075 Powell 10 0.987 0.010 0.970 0.015 0.861 0.122 0.951 0.046 0.949 0.040 0.909 0.085 0.920 0.056 0.916 0.047 0.283 0.236 0.920 0.052 0.883 0.057 0.971 0.009 0.755 0.204 0.834 0.118 0.834 0.118 Rastrigin 10 0.463 0.123 0.536 0.157 0.595 0.066 0.558 0.091 0.573 0.122 0.590 0.087 0.420 0.075 0.522 0.081 0.311 0.138 0.456 0.101 1.000 0.000 0.431 0.099 0.359 0.135 0.210 0.158 0.210 0.158 Rosenbrock 10 0.994 0.005 0.991 0.007 0.904 0.068 0.991 0.007 0.986 0.010 0.975 0.028 0.966 0.033 0.971 0.024 0.645 0.255 0.984 0.005 0.870 0.050 0.993 0.005 0.974 0.021 0.979 0.015 0.979 0.015 Styblinski-Tang 10 0.837 0.072 0.835 0.068 0.289 0.155 0.822 0.050 0.638 0.107 0.229 0.156 0.309 0.190 0.492 0.167 0.049 0.104 0.740 0.147 0.430 0.198 0.852 0.071 0.056 0.080 0.255 0.193 0.255 0.193 Ackley 20 0.827 0.045 0.851 0.032 0.777 0.067 0.818 0.023 0.781 0.035 0.404 0.100 0.741 0.027 0.753 0.030 0.474 0.148 0.731 0.048 1.000 0.000 0.755 0.050 0.252 0.106 0.299 0.110 0.299 0.110 Levy 20 0.949 0.028 0.943 0.017 0.889 0.073 0.945 0.025 0.904 0.041 0.907 0.056 0.926 0.031 0.900 0.045 0.819 0.075 0.934 0.025 0.977 0.004 0.955 0.024 0.879 0.097 0.746 0.157 0.746 0.157 Powell 20 0.955 0.028 0.965 0.017 0.872 0.085 0.939 0.020 0.913 0.077 0.915 0.061 0.948 0.031 0.913 0.058 0.845 0.104 0.936 0.040 0.966 0.016 0.967 0.020 0.912 0.061 0.876 0.094 0.876 0.094 Rastrigin 20 0.399 0.083 0.473 0.064 0.508 0.043 0.484 0.089 0.472 0.088 0.522 0.074 0.423 0.081 0.480 0.061 0.401 0.073 0.456 0.070 1.000 0.000 0.447 0.084 0.413 0.047 0.397 0.098 0.397 0.098 Rosenbrock 20 0.993 0.003 0.995 0.003 0.907 0.069 0.992 0.005 0.983 0.013 0.933 0.047 0.973 0.014 0.982 0.008 0.924 0.049 0.987 0.007 0.946 0.021 0.995 0.002 0.953 0.044 0.980 0.016 0.980 0.016 Styblinski-Tang 20 0.737 0.065 0.689 0.114 0.330 0.165 0.667 0.099 0.394 0.090 0.274 0.055 0.203 0.105 0.561 0.143 0.034 0.053 0.621 0.104 0.210 0.201 0.645 0.112 0.104 0.081 0.332 0.112 0.332 0.112 Ackley 50 0.235 0.275 0.342 0.276 0.823 0.045 0.623 0.264 0.594 0.218 0.465 0.103 0.638 0.041 0.759 0.015 0.730 0.021 0.745 0.042 1.000 0.000 0.546 0.140 0.273 0.067 0.179 0.121 0.179 0.121 Levy 50 0.940 0.082 0.965 0.020 0.943 0.018 0.971 0.019 0.958 0.016 0.879 0.025 0.948 0.016 0.951 0.035 0.941 0.015 0.952 0.012 0.987 0.001 0.955 0.016 0.880 0.023 0.901 0.095 0.901 0.095 Powell 50 0.954 0.029 0.982 0.008 0.950 0.023 0.975 0.008 0.969 0.010 0.938 0.024 0.970 0.004 0.961 0.015 0.980 0.007 0.965 0.014 0.986 0.003 0.957 0.030 0.955 0.014 0.939 0.027 0.939 0.027 Rastrigin 50 0.322 0.137 0.476 0.042 0.432 0.051 0.472 0.034 0.470 0.021 0.439 0.024 0.431 0.039 0.397 0.050 0.481 0.041 0.468 0.036 1.000 0.000 0.409 0.068 0.447 0.045 0.305 0.098 0.305 0.098 Rosenbrock 50 0.971 0.016 0.984 0.002 0.983 0.010 0.976 0.008 0.981 0.005 0.962 0.019 0.968 0.005 0.986 0.003 0.981 0.012 0.979 0.005 0.977 0.003 0.984 0.011 0.973 0.013 0.977 0.008 0.977 0.008 Styblinski-Tang 50 0.584 0.087 0.675 0.062 0.393 0.161 0.509 0.064 0.342 0.090 0.325 0.060 0.312 0.105 0.694 0.039 0.356 0.079 0.632 0.129 0.236 0.205 0.699 0.044 0.203 0.039 0.506 0.084 0.506 0.084 Ackley 100 0.277 0.343 0.190 0.344 0.863 0.028 0.417 0.361 0.540 0.281 0.682 0.093 0.708 0.015 0.645 0.102 0.844 0.016 0.742 0.074 0.007 0.004 0.299 0.120 0.244 0.231 0.348 0.246 0.348 0.246 Emb. Hartmann 6 100 0.951 0.090 0.987 0.008 0.928 0.076 0.957 0.079 0.936 0.068 0.907 0.057 0.896 0.076 0.913 0.064 0.916 0.122 0.870 0.153 0.446 0.294 0.912 0.118 0.633 0.163 0.636 0.169 0.636 0.169 Levy 100 0.837 0.155 0.961 0.024 0.944 0.016 0.966 0.013 0.950 0.017 0.940 0.019 0.950 0.008 0.934 0.035 0.964 0.009 0.952 0.021 0.183 0.290 0.903 0.053 0.763 0.283 0.913 0.182 0.913 0.182 Powell 100 0.810 0.036 0.952 0.066 0.983 0.008 0.985 0.002 0.980 0.006 0.979 0.006 0.982 0.005 0.981 0.009 0.984 0.005 0.971 0.015 0.397 0.332 0.964 0.022 0.691 0.115 0.685 0.123 0.685 0.123 Rastrigin 100 0.497 0.034 0.401 0.160 0.459 0.024 0.441 0.117 0.455 0.026 0.455 0.029 0.446 0.016 0.442 0.041 0.443 0.020 0.482 0.023 0.332 0.461 0.634 0.088 0.290 0.112 0.286 0.085 0.286 0.085 Rosenbrock 100 0.822 0.075 0.971 0.012 0.980 0.009 0.953 0.084 0.976 0.009 0.970 0.011 0.972 0.006 0.969 0.016 0.978 0.013 0.974 0.003 0.290 0.377 0.936 0.052 0.966 0.075 0.931 0.114 0.931 0.114 Styblinski-Tang 100 0.537 0.045 0.474 0.110 0.401 0.106 0.423 0.045 0.353 0.055 0.296 0.030 0.308 0.042 0.532 0.109 0.278 0.041 0.536 0.077 0.205 0.248 0.627 0.058 0.222 0.053 0.221 0.054 0.221 0.054

Mean 0.795 0.828 0.788 0.811 0.790 0.744 0.758 0.801 0.678 0.819 0.734 0.813 0.631 0.667 0.727 Median 0.940 0.961 0.889 0.951 0.949 0.907 0.920 0.913 0.819 0.931 0.966 0.955 0.755 0.834 0.819

Table A2: BO on noise-free synthetic test problems. The relative batch instantaneous regret of the last, exploitative batch is shown. Colors are normalized row-wise. The BEEBO and q-UCB columns are equivalent to Table 3. Lower means better. Results are means over five replicate runs.

Problem d mean BEEBO max BEEBO q-UCB q-EI TS KB GIBBON Tu RBO

T =0.05 T =0.5 T =5.0 T =0.05 T =0.5 T =5.0 κ=0.1 κ=1.0 κ=10.0 - - - default scaled -

Ackley 2 0.292 0.102 0.268 0.120 0.257 0.098 0.259 0.125 0.245 0.109 0.165 0.098 1.006 0.020 0.999 0.022 1.002 0.024 0.946 0.050 0.498 0.203 0.800 0.157 1.048 0.025 0.821 0.191 0.248 0.138 Levy 2 0.134 0.055 0.092 0.056 0.102 0.037 0.114 0.012 0.102 0.037 0.111 0.012 1.236 0.378 1.046 0.134 1.114 0.167 1.106 0.138 0.280 0.288 0.184 0.194 1.637 0.568 0.678 0.477 0.091 0.160 Rastrigin 2 0.455 0.053 0.425 0.147 0.407 0.335 0.578 0.230 0.454 0.150 0.500 0.055 1.010 0.069 0.999 0.042 1.020 0.060 0.839 0.115 0.692 0.180 0.763 0.169 1.106 0.130 0.600 0.223 0.062 0.064 Rosenbrock 2 0.001 0.001 0.001 0.001 0.002 0.001 0.004 0.002 0.004 0.002 0.002 0.001 0.992 0.325 1.094 0.337 1.014 0.194 1.215 0.314 0.004 0.005 0.008 0.009 1.875 1.701 0.045 0.052 0.000 0.000 Styblinski-Tang 2 0.168 0.012 0.169 0.008 0.170 0.008 0.172 0.006 0.170 0.009 0.170 0.008 1.024 0.095 1.027 0.062 1.051 0.095 0.851 0.155 0.039 0.002 0.285 0.413 1.674 1.055 1.240 0.719 0.181 0.064 Shekel 4 0.810 0.094 0.688 0.090 0.688 0.090 0.776 0.073 0.730 0.057 0.695 0.082 0.993 0.009 0.995 0.004 0.988 0.006 0.964 0.032 0.936 0.030 1.000 0.016 1.003 0.003 0.985 0.028 0.587 0.125 Hartmann 6 0.060 0.022 0.078 0.030 0.100 0.015 0.229 0.111 0.086 0.072 0.098 0.028 0.968 0.057 0.971 0.015 0.862 0.054 0.867 0.043 0.358 0.007 0.352 0.194 1.029 0.011 1.023 0.016 0.049 0.023 Cosine 8 0.045 0.135 0.001 0.001 0.222 0.063 0.001 0.000 0.016 0.008 0.061 0.039 0.953 0.066 0.975 0.050 0.922 0.059 1.112 0.158 0.446 0.026 1.069 0.283 1.690 0.142 0.678 0.290 0.087 0.047 Ackley 10 0.478 0.143 0.314 0.082 0.253 0.056 0.338 0.055 0.345 0.080 0.452 0.063 0.931 0.014 0.943 0.015 0.950 0.022 0.942 0.070 0.983 0.012 0.917 0.230 1.017 0.003 1.014 0.004 0.316 0.088 Levy 10 0.041 0.042 0.023 0.020 0.261 0.067 0.030 0.026 0.048 0.030 0.103 0.066 1.188 0.166 1.011 0.083 1.111 0.081 0.741 0.202 0.520 0.082 0.352 0.188 2.704 0.222 1.387 1.054 0.022 0.011 Powell 10 0.016 0.021 0.009 0.002 0.067 0.022 0.027 0.012 0.067 0.027 0.151 0.044 1.037 0.184 1.101 0.156 1.215 0.275 0.232 0.093 0.148 0.042 0.041 0.018 2.882 0.202 0.432 0.568 0.003 0.002 Rastrigin 10 0.629 0.091 0.523 0.132 0.567 0.130 0.563 0.156 0.541 0.104 0.402 0.150 0.920 0.027 0.907 0.048 0.905 0.036 0.931 0.069 0.809 0.100 0.962 0.100 1.191 0.022 0.782 0.060 0.344 0.107 Rosenbrock 10 0.002 0.000 0.004 0.002 0.074 0.009 0.013 0.008 0.015 0.007 0.052 0.021 0.906 0.109 0.770 0.098 0.918 0.075 0.078 0.041 0.098 0.016 0.005 0.001 2.462 0.153 0.251 0.161 0.004 0.006 Styblinski-Tang 10 0.196 0.027 0.223 0.029 0.559 0.056 0.220 0.040 0.337 0.022 0.496 0.063 1.174 0.136 1.126 0.077 1.219 0.055 1.247 0.344 0.809 0.103 0.689 0.144 2.339 0.229 1.118 0.393 0.177 0.075 Ackley 20 0.629 0.143 0.282 0.157 0.226 0.068 0.219 0.036 0.292 0.063 0.586 0.084 0.945 0.030 0.950 0.020 0.917 0.012 0.927 0.091 0.979 0.002 1.017 0.001 1.016 0.002 0.841 0.118 0.571 0.104 Levy 20 0.128 0.095 0.063 0.067 0.140 0.111 0.241 0.151 0.113 0.055 0.182 0.062 0.839 0.109 0.914 0.148 1.056 0.121 0.941 1.228 0.734 0.040 0.212 0.038 3.269 0.155 0.967 0.762 0.058 0.025 Powell 20 0.093 0.059 0.010 0.012 0.028 0.019 0.081 0.022 0.074 0.015 0.110 0.024 0.809 0.117 0.689 0.117 0.870 0.217 0.106 0.028 0.487 0.102 0.019 0.005 3.510 0.321 0.238 0.153 0.009 0.003 Rastrigin 20 0.686 0.068 0.610 0.053 0.541 0.094 0.600 0.044 0.635 0.058 0.555 0.044 0.864 0.034 0.838 0.042 0.852 0.035 0.784 0.116 0.861 0.020 0.858 0.330 1.246 0.072 0.651 0.111 0.487 0.109 Rosenbrock 20 0.047 0.031 0.004 0.002 0.036 0.026 0.105 0.057 0.048 0.038 0.051 0.013 0.591 0.171 0.578 0.147 0.903 0.130 0.060 0.018 0.387 0.096 0.006 0.002 2.681 0.116 0.326 0.233 0.005 0.004 Styblinski-Tang 20 0.426 0.187 0.378 0.128 0.691 0.165 0.398 0.074 0.504 0.036 0.578 0.074 1.113 0.099 1.107 0.121 1.177 0.094 0.924 0.100 0.887 0.035 0.765 0.127 2.765 0.114 1.070 0.459 0.287 0.108 Ackley 50 0.895 0.042 0.738 0.246 0.177 0.044 0.464 0.270 0.606 0.233 0.530 0.102 0.949 0.033 0.947 0.037 0.874 0.025 0.842 0.045 0.986 0.001 0.567 0.255 1.015 0.001 0.835 0.128 0.820 0.051 Levy 50 0.055 0.047 0.033 0.029 0.051 0.044 0.029 0.016 0.085 0.116 0.268 0.071 0.611 0.105 0.681 0.092 0.892 0.277 0.113 0.070 0.881 0.014 0.093 0.030 1.807 0.611 0.196 0.125 0.232 0.043 Powell 50 0.018 0.009 0.014 0.005 0.018 0.008 0.021 0.020 0.078 0.066 0.064 0.029 0.542 0.149 0.499 0.137 0.785 0.166 0.052 0.033 0.868 0.034 0.021 0.010 0.646 0.640 0.162 0.362 0.053 0.028 Rastrigin 50 0.793 0.184 0.653 0.247 0.795 0.374 0.573 0.061 0.592 0.048 0.585 0.075 0.813 0.030 0.810 0.044 0.768 0.019 0.662 0.050 0.934 0.012 0.860 0.353 1.234 0.220 0.716 0.121 0.560 0.037 Rosenbrock 50 0.016 0.009 0.049 0.123 0.010 0.007 0.021 0.009 0.031 0.019 0.048 0.021 0.539 0.175 0.520 0.134 0.594 0.142 0.033 0.010 0.801 0.037 0.014 0.005 1.337 0.430 0.031 0.022 0.055 0.027 Styblinski-Tang 50 0.463 0.206 0.676 1.257 0.681 0.142 0.478 0.099 0.574 0.079 0.727 0.031 1.012 0.065 1.196 0.136 0.981 0.049 0.685 0.238 0.961 0.018 0.620 0.358 1.696 0.490 0.617 0.178 0.454 0.055 Ackley 100 0.718 0.340 0.900 0.256 0.137 0.027 0.636 0.313 0.466 0.275 0.321 0.092 0.948 0.031 0.935 0.036 0.863 0.038 0.805 0.087 0.997 0.001 0.731 0.156 1.005 0.013 0.940 0.080 0.902 0.009 Emb. Hartmann 6 100 0.068 0.052 0.035 0.031 0.175 0.119 0.144 0.134 0.086 0.089 0.172 0.117 0.573 0.042 0.863 0.041 0.692 0.131 0.423 0.307 0.869 0.023 0.110 0.107 0.864 0.021 0.872 0.026 0.098 0.094 Levy 100 0.119 0.103 0.044 0.032 0.042 0.013 0.031 0.013 0.164 0.103 0.056 0.025 0.615 0.085 0.716 0.107 0.586 0.107 0.139 0.096 0.975 0.021 0.094 0.050 1.129 0.956 0.116 0.208 0.303 0.028 Powell 100 0.094 0.017 0.027 0.022 0.013 0.009 0.011 0.003 0.041 0.054 0.018 0.016 0.465 0.070 0.493 0.092 0.524 0.130 0.086 0.161 1.008 0.049 0.027 0.010 0.431 0.080 0.420 0.183 0.104 0.013 Rastrigin 100 0.506 0.051 0.604 0.142 0.540 0.035 0.501 0.092 0.557 0.053 0.544 0.047 0.759 0.019 0.832 0.034 0.780 0.025 0.658 0.050 0.986 0.012 0.624 0.087 0.918 0.214 0.713 0.144 0.584 0.018 Rosenbrock 100 0.114 0.043 0.027 0.011 0.014 0.009 0.044 0.051 0.048 0.039 0.031 0.015 0.518 0.100 0.589 0.127 0.507 0.092 0.091 0.072 0.962 0.045 0.046 0.026 0.758 0.520 0.127 0.205 0.133 0.019 Styblinski-Tang 100 0.389 0.030 0.503 0.222 0.562 0.142 0.522 0.095 0.582 0.132 0.742 0.082 0.924 0.042 1.203 0.188 0.930 0.049 0.584 0.208 0.979 0.013 0.333 0.059 0.752 0.052 0.855 0.179 0.538 0.027

Mean 0.29 0.26 0.26 0.26 0.26 0.29 0.87 0.89 0.90 0.64 0.70 0.44 1.57 0.66 0.26 Median 0.13 0.09 0.17 0.22 0.16 0.17 0.93 0.94 0.92 0.78 0.86 0.35 1.23 0.71 0.18

Table A3: Paired t-test p-values for the results of mean BEEBO in Table 2. The combined p-value was computed using Fisher s method. P-values smaller than 0.05 are indicated in bold.

mean BEEBO T =0.05 mean BEEBO T =0.5 mean BEEBO T =5.0

Problem d q-UCB q-EI TS KB GIBBON GIBBON (s) Tu RBO q-UCB q-EI TS KB GIBBON GIBBON (s) Tu RBO q-UCB q-EI TS KB GIBBON GIBBON (s) Tu RBO

Ackley 2 8E-03 7E-02 1E+00 9E-03 3E-03 8E-04 1E-01 1E-01 6E-01 9E-01 3E-01 8E-03 1E-02 4E-01 9E-01 8E-01 1E+00 7E-01 8E-03 7E-02 6E-01 Levy 2 4E-01 5E-01 2E-01 1E+00 6E-04 4E-02 7E-02 7E-01 1E+00 3E-01 1E+00 5E-04 5E-02 7E-02 2E-01 1E+00 7E-01 1E+00 9E-04 7E-02 7E-02 Rastrigin 2 7E-03 9E-01 1E+00 3E-01 1E-02 1E-02 9E-01 2E-01 8E-01 1E+00 4E-02 7E-03 2E-03 5E-01 5E-04 1E+00 1E+00 2E-01 4E-03 9E-03 8E-01 Rosenbrock 2 1E-01 8E-01 2E-01 5E-01 6E-03 1E-01 3E-02 3E-01 9E-01 8E-01 9E-01 6E-03 1E-01 1E-01 8E-01 9E-01 9E-01 1E+00 9E-03 1E-01 8E-02 Styblinski-Tang 2 9E-01 9E-01 9E-01 9E-01 9E-01 9E-01 2E-05 3E-01 5E-01 1E-01 9E-01 2E-03 5E-02 5E-06 1E-01 6E-01 6E-02 1E+00 2E-03 1E-02 5E-06 Shekel 4 5E-03 4E-01 4E-04 4E-02 4E-04 6E-03 9E-01 4E-06 4E-04 8E-10 2E-05 1E-08 7E-09 5E-03 6E-04 9E-02 1E-04 1E-02 2E-04 6E-04 4E-01 Hartmann 6 8E-04 7E-02 1E-06 7E-02 4E-05 5E-04 1E-04 7E-03 7E-02 1E-06 7E-02 4E-05 5E-04 1E-04 3E-04 9E-01 2E-06 9E-01 1E-04 1E-03 3E-04 Cosine 8 6E-05 1E-06 1E+00 1E-03 8E-04 1E-03 6E-05 2E-05 1E-06 1E+00 2E-03 9E-04 1E-03 8E-05 5E-01 1E+00 1E+00 1E+00 1E+00 1E+00 1E+00 Ackley 10 4E-05 3E-05 1E+00 3E-06 5E-09 1E-06 8E-05 6E-05 2E-04 1E+00 5E-06 5E-10 8E-07 1E-04 8E-05 1E-01 1E+00 8E-04 4E-09 1E-05 1E-02 Levy 10 5E-05 1E-05 1E-07 6E-03 5E-05 7E-04 3E-03 1E-03 3E-03 1E-01 9E-01 8E-04 5E-03 7E-01 2E-04 1E-03 2E-01 8E-01 1E-04 5E-04 7E-01 Powell 10 3E-03 1E-03 1E-04 3E-04 3E-03 1E-03 1E-01 2E-03 9E-03 6E-04 6E-01 4E-03 2E-03 9E-01 1E-05 9E-01 7E-01 1E+00 4E-02 3E-01 1E+00 Rastrigin 10 2E-01 4E-01 1E+00 2E-01 6E-02 6E-05 1E+00 4E-01 6E-02 1E+00 5E-03 5E-03 4E-05 1E+00 1E-04 7E-05 1E+00 3E-04 8E-06 4E-05 1E+00 Rosenbrock 10 5E-03 2E-03 6E-06 1E-01 1E-03 8E-04 3E-01 5E-03 3E-03 5E-06 1E+00 4E-03 4E-03 8E-01 4E-03 1E+00 1E-01 1E+00 1E+00 1E+00 1E+00 Styblinski-Tang 10 6E-06 1E-02 3E-04 8E-01 2E-09 2E-06 3E-03 1E-05 1E-02 2E-04 8E-01 3E-09 3E-06 3E-03 8E-04 1E+00 1E+00 1E+00 2E-04 3E-01 1E+00 Ackley 20 8E-04 7E-04 1E+00 9E-03 4E-08 1E-07 3E-06 7E-05 1E-04 1E+00 1E-03 6E-09 2E-07 1E-06 9E-06 4E-02 1E+00 2E-01 2E-08 1E-06 8E-06 Levy 20 6E-02 2E-01 1E+00 8E-01 2E-02 1E-03 1E-01 1E-02 2E-01 1E+00 1E+00 3E-02 1E-03 2E-01 2E-02 1E+00 1E+00 1E+00 4E-01 1E-02 9E-01 Powell 20 3E-01 1E-02 1E+00 1E+00 2E-02 7E-03 1E+00 2E-03 1E-02 6E-01 6E-01 9E-03 5E-03 8E-01 2E-01 1E+00 1E+00 1E+00 1E+00 6E-01 1E+00 Rastrigin 20 8E-01 1E+00 1E+00 9E-01 7E-01 5E-01 1E+00 6E-01 2E-01 1E+00 1E-01 2E-02 2E-02 9E-01 1E-04 1E-02 1E+00 2E-02 2E-03 2E-03 6E-01 Rosenbrock 20 6E-04 1E-02 2E-05 1E+00 9E-03 1E-02 3E-01 4E-05 3E-04 1E-05 8E-01 8E-03 6E-03 1E-01 8E-01 1E+00 1E+00 1E+00 1E+00 1E+00 1E+00 Styblinski-Tang 20 1E-08 5E-04 2E-05 2E-03 3E-10 8E-07 1E-04 2E-02 6E-02 6E-05 1E-01 3E-08 2E-05 3E-02 5E-05 1E+00 1E-01 1E+00 5E-04 5E-01 1E+00 Ackley 50 1E+00 1E+00 1E+00 1E+00 6E-01 3E-01 4E-01 1E+00 1E+00 1E+00 1E+00 3E-01 6E-02 8E-02 2E-04 4E-03 1E+00 2E-04 1E-09 2E-07 6E-10 Levy 50 6E-01 7E-01 1E+00 7E-01 3E-02 8E-02 9E-05 2E-01 4E-02 1E+00 8E-02 8E-06 2E-02 3E-08 4E-01 9E-01 1E+00 9E-01 6E-06 1E-01 2E-06 Powell 50 9E-01 8E-01 1E+00 6E-01 5E-01 1E-01 1E-02 2E-04 3E-04 1E+00 8E-03 2E-06 9E-05 1E-06 1E+00 1E+00 1E+00 7E-01 8E-01 9E-02 1E-02 Rastrigin 50 1E+00 1E+00 1E+00 1E+00 1E+00 4E-01 1E+00 5E-04 3E-01 1E+00 3E-02 5E-02 1E-04 1E-01 1E+00 1E+00 1E+00 2E-01 8E-01 4E-03 8E-01 Rosenbrock 50 3E-01 9E-01 9E-01 1E+00 6E-01 8E-01 2E-02 9E-01 8E-03 3E-05 4E-01 1E-02 2E-02 4E-04 3E-01 2E-01 4E-02 6E-01 2E-02 1E-01 3E-04 Styblinski-Tang 50 1E-04 9E-01 2E-04 1E+00 3E-07 3E-02 3E-03 9E-01 1E-01 4E-05 8E-01 9E-10 3E-04 1E-05 2E-01 1E+00 8E-02 1E+00 2E-03 9E-01 9E-01 Ackley 100 1E+00 1E+00 2E-02 6E-01 5E-01 7E-01 7E-02 1E+00 1E+00 6E-02 8E-01 6E-01 7E-01 2E-01 4E-02 5E-04 3E-15 3E-07 2E-05 2E-04 2E-15 Emb. Hartmann 6 100 7E-02 1E-02 4E-04 1E-01 2E-05 3E-05 3E-01 2E-03 2E-02 1E-04 4E-02 4E-05 5E-05 3E-02 4E-01 2E-01 8E-05 4E-01 6E-05 8E-05 7E-01 Levy 100 1E+00 1E+00 2E-05 9E-01 3E-01 9E-01 6E-03 2E-02 2E-01 6E-06 1E-03 3E-02 2E-01 3E-09 1E+00 8E-01 9E-06 3E-02 4E-02 3E-01 5E-10 Powell 100 1E+00 1E+00 1E-03 1E+00 3E-03 5E-03 1E+00 9E-01 8E-01 2E-04 8E-01 4E-07 7E-05 1E-03 7E-01 3E-02 2E-04 2E-02 1E-05 2E-05 2E-08 Rastrigin 100 2E-03 1E-01 1E-01 1E+00 5E-04 4E-06 6E-05 8E-01 9E-01 3E-01 1E+00 2E-03 2E-02 4E-01 7E-02 1E+00 2E-01 1E+00 3E-04 9E-05 6E-05 Rosenbrock 100 1E+00 1E+00 3E-04 1E+00 1E+00 1E+00 9E-01 2E-01 7E-01 2E-04 2E-02 4E-01 2E-01 6E-08 4E-01 5E-02 1E-04 2E-02 3E-01 1E-01 5E-08 Styblinski-Tang 100 3E-06 5E-01 8E-04 1E+00 4E-08 4E-08 3E-06 9E-01 9E-01 9E-03 1E+00 3E-05 3E-05 3E-02 3E-03 1E+00 2E-02 1E+00 4E-04 1E-03 4E-01

Combined 4E-16 2E-02 8E-02 1E+00 6E-47 1E-40 2E-18 1E-20 2E-12 7E-06 1E-03 2E-63 3E-51 2E-36 2E-21 1E+00 1E+00 1E+00 4E-44 4E-27 2E-23

Table A4: Paired t-test p-values for the results of max BEEBO in Table 2. The combined p-value was computed using Fisher s method. P-values smaller than 0.05 are indicated in bold.

max BEEBO T =0.05 max BEEBO T =0.5 max BEEBO T =5.0

Problem d q-UCB q-EI TS KB GIBBON GIBBON (s) Tu RBO q-UCB q-EI TS KB GIBBON GIBBON (s) Tu RBO q-UCB q-EI TS KB GIBBON GIBBON (s) Tu RBO

Ackley 2 2E-01 7E-01 1E+00 5E-01 3E-03 2E-03 4E-01 2E-01 7E-01 9E-01 5E-01 1E-02 4E-02 5E-01 5E-01 5E-01 1E+00 1E-01 4E-03 5E-03 3E-01 Levy 2 8E-01 9E-01 3E-01 1E+00 7E-04 5E-02 7E-02 3E-01 9E-01 2E-01 1E+00 5E-04 4E-02 7E-02 5E-01 1E+00 9E-01 1E+00 3E-03 9E-02 7E-02 Rastrigin 2 2E-04 7E-01 1E+00 8E-03 5E-03 2E-03 3E-01 5E-01 1E+00 1E+00 2E-01 1E-02 4E-03 9E-01 3E-05 7E-01 1E+00 2E-02 6E-03 1E-03 2E-01 Rosenbrock 2 8E-02 5E-01 2E-01 3E-01 6E-03 9E-02 3E-02 2E-01 6E-01 3E-01 4E-01 5E-03 9E-02 4E-02 9E-01 9E-01 1E+00 1E+00 1E-02 2E-01 2E-01 Styblinski-Tang 2 4E-01 2E-01 7E-04 1E+00 2E-03 1E-02 5E-06 5E-01 8E-01 2E-01 1E+00 2E-03 3E-02 5E-06 4E-01 8E-01 6E-01 9E-01 4E-03 1E-01 5E-06 Shekel 4 1E-01 1E+00 2E-04 8E-01 5E-04 3E-01 1E+00 3E-01 9E-01 5E-03 5E-01 2E-02 1E-01 1E+00 3E-02 9E-01 4E-03 3E-01 7E-03 4E-02 1E+00 Hartmann 6 9E-01 1E+00 2E-05 1E+00 4E-03 3E-01 1E+00 1E-01 8E-01 1E-05 8E-01 2E-04 2E-03 2E-02 9E-04 9E-01 7E-06 9E-01 2E-04 3E-03 1E-02 Cosine 8 6E-05 1E-06 1E+00 1E-03 8E-04 1E-03 7E-05 1E-03 2E-05 1E+00 1E+00 4E-03 2E-02 8E-04 2E-03 1E-02 1E+00 1E+00 6E-01 9E-01 9E-01 Ackley 10 2E-01 2E-01 1E+00 2E-04 2E-08 1E-05 2E-02 9E-01 1E+00 1E+00 7E-01 6E-08 6E-05 9E-01 2E-01 1E+00 1E+00 1E+00 1E-04 5E-01 1E+00 Levy 10 5E-04 3E-03 9E-02 9E-01 6E-04 5E-03 8E-01 3E-03 4E-03 9E-01 1E+00 7E-04 5E-03 1E+00 1E-03 9E-01 1E+00 1E+00 1E-01 1E-01 1E+00 Powell 10 8E-02 4E-02 1E-03 9E-01 2E-03 1E-03 1E+00 4E-02 4E-02 2E-03 9E-01 3E-03 2E-03 1E+00 9E-06 8E-01 7E-02 1E+00 1E-02 5E-02 1E+00 Rastrigin 10 3E-04 1E-02 1E+00 3E-03 6E-04 9E-05 1E+00 2E-01 2E-03 1E+00 1E-04 1E-03 1E-04 1E+00 2E-04 5E-03 1E+00 4E-03 1E-04 7E-05 9E-01 Rosenbrock 10 1E-02 4E-03 6E-06 1E+00 6E-03 6E-03 8E-01 1E-02 3E-01 4E-06 1E+00 9E-03 2E-02 1E+00 1E-03 8E-01 4E-06 1E+00 4E-01 7E-01 1E+00 Styblinski-Tang 10 7E-06 5E-02 2E-04 1E+00 5E-10 4E-06 2E-03 7E-03 9E-01 8E-03 1E+00 2E-07 8E-05 6E-01 1E-02 1E+00 1E+00 1E+00 9E-03 6E-01 1E+00 Ackley 20 1E-04 6E-04 1E+00 9E-03 1E-08 3E-07 2E-06 1E-02 2E-02 1E+00 9E-02 4E-08 7E-07 2E-06 9E-01 1E+00 1E+00 1E+00 6E-04 2E-02 1E+00 Levy 20 1E-01 1E-01 1E+00 8E-01 6E-02 9E-04 1E-01 3E-01 1E+00 1E+00 1E+00 3E-01 2E-03 8E-01 2E-02 9E-01 1E+00 1E+00 2E-01 3E-03 7E-01 Powell 20 9E-01 4E-01 1E+00 1E+00 6E-02 2E-02 1E+00 5E-01 9E-01 1E+00 1E+00 5E-01 2E-02 1E+00 3E-03 9E-01 1E+00 1E+00 4E-01 1E-02 1E+00 Rastrigin 20 5E-03 7E-02 1E+00 3E-02 1E-02 3E-03 9E-01 6E-01 1E-01 1E+00 1E-01 4E-02 1E-02 9E-01 9E-05 1E-02 1E+00 2E-02 1E-03 5E-03 4E-01 Rosenbrock 20 5E-04 1E-02 1E-05 1E+00 8E-03 9E-03 4E-01 4E-01 9E-01 4E-06 1E+00 2E-02 2E-01 1E+00 3E-01 1E+00 9E-01 1E+00 9E-01 1E+00 1E+00 Styblinski-Tang 20 5E-08 1E-01 9E-05 3E-01 2E-09 2E-06 1E-01 1E+00 1E+00 1E-02 1E+00 4E-06 7E-02 1E+00 4E-07 1E+00 1E-01 1E+00 1E-05 9E-01 1E+00 Ackley 50 6E-01 9E-01 1E+00 2E-01 2E-02 9E-04 8E-04 1E+00 1E+00 1E+00 3E-01 7E-04 3E-03 4E-04 1E+00 1E+00 1E+00 9E-01 5E-04 3E-04 5E-05 Levy 50 2E-02 2E-02 1E+00 8E-03 3E-06 1E-02 1E-07 3E-01 1E-01 1E+00 4E-01 8E-07 4E-02 4E-07 1E+00 1E+00 1E+00 1E+00 5E-01 7E-01 5E-05 Powell 50 4E-02 3E-02 1E+00 5E-02 2E-04 7E-04 8E-06 8E-02 2E-01 1E+00 9E-02 6E-03 1E-03 5E-06 1E+00 1E+00 1E+00 1E+00 1E+00 5E-01 4E-03 Rastrigin 50 1E-02 4E-01 1E+00 2E-02 4E-02 6E-04 2E-01 2E-04 5E-01 1E+00 1E-02 8E-02 3E-04 2E-01 1E+00 1E+00 1E+00 1E-01 7E-01 5E-04 7E-01 Rosenbrock 50 7E-03 9E-01 7E-01 1E+00 2E-01 7E-01 5E-03 1E+00 2E-01 2E-02 8E-01 4E-02 2E-01 7E-04 1E+00 1E+00 1E+00 1E+00 1E+00 1E+00 3E-01 Styblinski-Tang 50 5E-05 1E+00 2E-03 1E+00 3E-07 5E-01 2E-01 1E+00 1E+00 8E-02 1E+00 4E-05 1E+00 1E+00 8E-01 1E+00 1E-01 1E+00 6E-05 1E+00 1E+00 Ackley 100 1E+00 1E+00 3E-03 2E-01 6E-02 1E-01 1E-02 9E-01 1E+00 9E-05 2E-02 1E-02 1E-01 4E-04 1E+00 9E-01 1E-09 8E-07 2E-04 2E-03 1E-08 Emb. Hartmann 6 100 3E-02 4E-02 7E-05 9E-02 2E-05 3E-05 2E-01 3E-01 1E-01 2E-04 2E-01 2E-04 2E-04 5E-01 6E-01 2E-01 3E-04 6E-01 3E-04 4E-04 9E-01 Levy 100 5E-03 7E-02 6E-06 1E-03 2E-02 2E-01 5E-10 1E-01 6E-01 7E-06 1E-02 4E-02 3E-01 1E-09 1E+00 9E-01 7E-06 2E-02 4E-02 3E-01 2E-09 Powell 100 6E-02 6E-03 2E-04 9E-03 1E-05 2E-05 7E-09 7E-01 7E-02 2E-04 2E-02 1E-05 2E-05 9E-09 1E+00 1E-01 2E-04 3E-02 1E-05 2E-05 5E-08 Rastrigin 100 6E-01 8E-01 2E-01 1E+00 1E-02 4E-03 6E-02 2E-01 1E+00 2E-01 1E+00 1E-03 9E-05 3E-04 2E-01 1E+00 2E-01 1E+00 2E-03 1E-04 2E-04 Rosenbrock 100 8E-01 8E-01 1E-04 2E-01 6E-01 4E-01 4E-03 1E-01 2E-01 1E-04 1E-02 3E-01 1E-01 3E-08 9E-01 9E-01 1E-04 4E-02 4E-01 2E-01 2E-07 Styblinski-Tang 100 3E-04 1E+00 6E-03 1E+00 2E-05 5E-06 1E-01 1E+00 1E+00 4E-02 1E+00 5E-04 5E-04 9E-01 1E-01 1E+00 1E-01 1E+00 4E-03 2E-03 1E+00

Combined 1E-27 3E-04 1E+00 1E+00 6E-69 1E-44 1E-24 9E-02 1E+00 1E+00 1E+00 3E-51 9E-36 3E-16 1E-04 1E+00 1E+00 1E+00 8E-31 6E-16 3E-03

D.2 Results for batch sizes 5 and 10

Table A5: BO on noise-free synthetic test problems. The normalized highest observed value after 10 rounds of BO with q=5 is shown. Colors are normalized row-wise. Higher means better. Results are means over ten replicate runs.

Problem d mean BEEBO max BEEBO q-UCB q-EI TS KB GIBBON Tu RBO

T =0.05 T =0.5 T =5.0 T =0.05 T =0.5 T =5.0 κ=0.1 κ=1.0 κ=10.0 - - - default -

Ackley 2 0.116 0.249 0.240 0.358 0.903 0.052 0.865 0.073 0.865 0.044 0.849 0.049 0.793 0.096 0.919 0.062 0.785 0.084 0.885 0.042 0.555 0.312 0.865 0.063 0.799 0.091 0.550 0.409 Levy 2 0.800 0.198 0.880 0.186 0.995 0.005 0.994 0.008 0.998 0.003 0.995 0.009 0.997 0.004 0.996 0.008 0.992 0.006 0.997 0.004 0.912 0.128 0.998 0.006 0.975 0.018 0.821 0.208 Rastrigin 2 0.513 0.210 0.493 0.289 0.920 0.038 0.875 0.075 0.924 0.044 0.851 0.095 0.842 0.088 0.950 0.029 0.839 0.136 0.871 0.129 0.701 0.163 0.875 0.087 0.808 0.096 0.509 0.275 Rosenbrock 2 0.955 0.134 0.973 0.056 0.976 0.029 0.961 0.097 0.984 0.019 0.829 0.331 0.926 0.224 0.894 0.314 0.892 0.314 0.912 0.228 0.763 0.324 0.889 0.313 0.872 0.308 0.884 0.312 Styblinski-Tang 2 0.594 0.295 0.705 0.316 0.897 0.106 0.669 0.284 0.850 0.161 0.893 0.167 0.837 0.227 0.898 0.168 0.913 0.138 0.948 0.068 0.692 0.298 0.909 0.131 0.866 0.108 0.422 0.313 Shekel 4 0.198 0.046 0.236 0.087 0.194 0.071 0.167 0.052 0.177 0.079 0.112 0.065 0.131 0.065 0.260 0.069 0.103 0.041 0.196 0.072 0.131 0.038 0.160 0.053 0.142 0.085 0.175 0.075 Hartmann 6 0.950 0.040 0.948 0.044 0.880 0.097 0.935 0.046 0.862 0.204 0.681 0.255 0.795 0.118 0.954 0.037 0.486 0.207 0.944 0.040 0.792 0.106 0.950 0.039 0.905 0.075 0.916 0.049 Cosine 8 0.553 0.101 0.701 0.093 0.847 0.058 0.611 0.105 0.702 0.079 0.663 0.051 0.698 0.119 0.824 0.054 0.793 0.089 0.759 0.064 0.722 0.322 0.741 0.085 0.685 0.084 0.624 0.110 Ackley 10 0.053 0.068 0.272 0.295 0.537 0.162 0.143 0.146 0.425 0.228 0.290 0.157 0.191 0.104 0.620 0.061 0.228 0.117 0.597 0.068 0.348 0.451 0.532 0.127 0.530 0.081 0.310 0.129 Levy 10 0.595 0.222 0.570 0.291 0.910 0.093 0.695 0.086 0.802 0.115 0.839 0.081 0.848 0.127 0.847 0.111 0.855 0.119 0.826 0.155 0.558 0.289 0.817 0.097 0.844 0.077 0.640 0.215 Powell 10 0.938 0.041 0.932 0.045 0.925 0.048 0.936 0.032 0.924 0.028 0.797 0.258 0.844 0.146 0.932 0.040 0.656 0.233 0.940 0.029 0.600 0.323 0.933 0.035 0.795 0.153 0.898 0.076 Rastrigin 10 0.377 0.118 0.331 0.133 0.375 0.101 0.276 0.097 0.292 0.095 0.357 0.094 0.400 0.085 0.413 0.168 0.269 0.142 0.406 0.065 0.231 0.152 0.426 0.032 0.400 0.087 0.365 0.156 Rosenbrock 10 0.967 0.030 0.984 0.018 0.987 0.009 0.974 0.020 0.971 0.015 0.964 0.022 0.965 0.057 0.984 0.014 0.974 0.014 0.976 0.017 0.776 0.224 0.973 0.021 0.944 0.043 0.962 0.041 Styblinski-Tang 10 0.631 0.135 0.662 0.106 0.270 0.170 0.623 0.136 0.552 0.131 0.166 0.176 0.394 0.117 0.594 0.111 0.139 0.153 0.632 0.133 0.261 0.191 0.602 0.192 0.258 0.222 0.400 0.105 Ackley 20 0.019 0.011 0.044 0.062 0.354 0.175 0.059 0.061 0.182 0.173 0.148 0.086 0.276 0.112 0.231 0.208 0.186 0.092 0.199 0.141 0.109 0.313 0.172 0.147 0.303 0.161 0.052 0.024 Levy 20 0.571 0.155 0.608 0.175 0.795 0.082 0.663 0.118 0.681 0.132 0.760 0.127 0.788 0.098 0.663 0.145 0.780 0.068 0.665 0.112 0.373 0.214 0.660 0.148 0.751 0.110 0.567 0.158 Powell 20 0.864 0.062 0.889 0.061 0.902 0.060 0.872 0.075 0.874 0.053 0.854 0.101 0.920 0.048 0.894 0.060 0.812 0.125 0.882 0.076 0.460 0.180 0.894 0.049 0.891 0.071 0.764 0.134 Rastrigin 20 0.237 0.121 0.256 0.103 0.272 0.127 0.217 0.120 0.239 0.119 0.266 0.143 0.311 0.124 0.223 0.111 0.257 0.091 0.225 0.111 0.109 0.083 0.227 0.125 0.308 0.047 0.167 0.095 Rosenbrock 20 0.923 0.012 0.958 0.011 0.964 0.021 0.924 0.026 0.942 0.024 0.932 0.034 0.928 0.023 0.958 0.012 0.942 0.026 0.941 0.036 0.368 0.257 0.960 0.018 0.932 0.030 0.891 0.037 Styblinski-Tang 20 0.576 0.080 0.529 0.125 0.401 0.085 0.529 0.078 0.487 0.079 0.271 0.092 0.275 0.092 0.558 0.083 0.180 0.110 0.497 0.101 0.130 0.088 0.591 0.075 0.371 0.096 0.258 0.077 Ackley 50 0.012 0.011 0.009 0.007 0.024 0.016 0.013 0.006 0.013 0.005 0.023 0.011 0.014 0.012 0.014 0.007 0.070 0.030 0.043 0.038 0.008 0.007 0.017 0.009 0.029 0.014 0.044 0.010 Levy 50 0.464 0.086 0.411 0.145 0.347 0.119 0.393 0.122 0.363 0.143 0.382 0.116 0.241 0.114 0.374 0.161 0.504 0.158 0.419 0.153 0.191 0.078 0.486 0.135 0.450 0.087 0.501 0.102 Powell 50 0.708 0.073 0.732 0.056 0.716 0.077 0.634 0.098 0.694 0.051 0.745 0.076 0.416 0.141 0.711 0.069 0.786 0.079 0.678 0.135 0.292 0.251 0.783 0.055 0.804 0.068 0.697 0.089 Rastrigin 50 0.206 0.090 0.209 0.061 0.132 0.058 0.177 0.060 0.190 0.058 0.170 0.061 0.072 0.039 0.187 0.058 0.277 0.085 0.185 0.071 0.076 0.057 0.162 0.083 0.210 0.083 0.262 0.046 Rosenbrock 50 0.713 0.083 0.751 0.077 0.698 0.117 0.640 0.114 0.705 0.067 0.679 0.094 0.430 0.158 0.768 0.065 0.819 0.076 0.724 0.116 0.257 0.163 0.766 0.057 0.777 0.076 0.684 0.061 Styblinski-Tang 50 0.428 0.060 0.392 0.069 0.384 0.079 0.382 0.048 0.351 0.060 0.234 0.067 0.203 0.074 0.368 0.055 0.287 0.124 0.336 0.057 0.133 0.069 0.414 0.038 0.293 0.054 0.269 0.077 Ackley 100 0.042 0.013 0.035 0.016 0.079 0.021 0.037 0.011 0.060 0.012 0.081 0.021 0.048 0.021 0.051 0.022 0.180 0.041 0.069 0.021 0.005 0.004 0.042 0.011 0.013 0.005 0.028 0.006 Emb. Hartmann 6 100 0.554 0.216 0.577 0.198 0.645 0.225 0.561 0.234 0.413 0.197 0.649 0.112 0.547 0.227 0.566 0.212 0.693 0.125 0.678 0.163 0.411 0.161 0.703 0.155 0.485 0.163 0.765 0.163 Levy 100 0.522 0.069 0.545 0.053 0.598 0.053 0.595 0.050 0.589 0.101 0.680 0.082 0.511 0.101 0.599 0.064 0.796 0.041 0.611 0.092 0.115 0.072 0.600 0.051 0.216 0.063 0.398 0.075 Powell 100 0.660 0.071 0.643 0.103 0.810 0.076 0.657 0.063 0.725 0.057 0.860 0.073 0.626 0.186 0.679 0.085 0.938 0.014 0.673 0.089 0.228 0.194 0.725 0.088 0.303 0.150 0.654 0.107 Rastrigin 100 0.259 0.046 0.256 0.058 0.304 0.032 0.280 0.052 0.290 0.052 0.323 0.028 0.263 0.049 0.290 0.056 0.409 0.029 0.304 0.056 0.060 0.041 0.279 0.035 0.119 0.040 0.188 0.039 Rosenbrock 100 0.610 0.085 0.600 0.096 0.714 0.058 0.556 0.127 0.617 0.136 0.748 0.044 0.528 0.142 0.652 0.091 0.885 0.029 0.643 0.077 0.212 0.108 0.712 0.061 0.364 0.090 0.552 0.102 Styblinski-Tang 100 0.354 0.065 0.356 0.061 0.331 0.043 0.332 0.053 0.318 0.070 0.321 0.054 0.267 0.062 0.331 0.049 0.324 0.047 0.303 0.037 0.095 0.048 0.350 0.042 0.148 0.067 0.275 0.077

Mean 0.514 0.537 0.609 0.553 0.578 0.558 0.525 0.612 0.577 0.605 0.354 0.612 0.533 0.500 Median 0.554 0.570 0.698 0.611 0.617 0.679 0.511 0.652 0.693 0.665 0.261 0.703 0.485 0.509

Table A6: BO on noise-free synthetic test problems. The normalized highest observed value after 10 rounds of BO with q=10 is shown. Colors are normalized row-wise. Higher means better. Results are means over ten replicate runs.

Problem d mean BEEBO max BEEBO q-UCB q-EI TS KB GIBBON Tu RBO

T =0.05 T =0.5 T =5.0 T =0.05 T =0.5 T =5.0 κ=0.1 κ=1.0 κ=10.0 - - - default -

Ackley 2 0.154 0.307 0.466 0.334 0.913 0.062 0.861 0.119 0.936 0.045 0.939 0.038 0.875 0.059 0.951 0.021 0.869 0.060 0.940 0.048 0.755 0.293 0.909 0.057 0.775 0.082 0.575 0.411 Levy 2 0.908 0.124 0.998 0.001 0.997 0.003 0.999 0.001 0.999 0.001 0.996 0.004 0.998 0.002 0.997 0.004 0.988 0.015 0.997 0.003 0.919 0.083 0.998 0.002 0.985 0.013 0.844 0.154 Rastrigin 2 0.545 0.307 0.684 0.238 0.953 0.038 0.942 0.053 0.932 0.048 0.911 0.034 0.911 0.043 0.936 0.033 0.839 0.096 0.939 0.034 0.731 0.255 0.913 0.054 0.902 0.070 0.726 0.300 Rosenbrock 2 0.976 0.048 0.979 0.040 0.883 0.313 0.992 0.009 0.961 0.100 0.987 0.014 0.955 0.118 0.977 0.035 0.989 0.014 0.972 0.047 0.863 0.312 0.994 0.008 0.837 0.305 0.829 0.303 Styblinski-Tang 2 0.441 0.263 0.983 0.027 0.941 0.065 0.991 0.011 0.991 0.013 0.964 0.052 0.989 0.012 0.998 0.003 0.984 0.020 0.997 0.003 0.836 0.186 0.983 0.021 0.911 0.063 0.266 0.187 Shekel 4 0.287 0.099 0.257 0.071 0.250 0.103 0.245 0.073 0.265 0.096 0.145 0.037 0.151 0.051 0.266 0.064 0.101 0.028 0.200 0.054 0.132 0.049 0.197 0.103 0.136 0.068 0.300 0.131 Hartmann 6 0.968 0.035 0.968 0.034 0.929 0.062 0.966 0.046 0.949 0.046 0.880 0.081 0.917 0.054 0.964 0.037 0.626 0.184 0.956 0.047 0.855 0.025 0.964 0.037 0.887 0.043 0.968 0.028 Cosine 8 0.728 0.100 0.821 0.106 0.775 0.095 0.747 0.079 0.756 0.088 0.679 0.104 0.763 0.059 0.785 0.083 0.817 0.082 0.753 0.058 1.000 0.000 0.743 0.063 0.626 0.048 0.797 0.083 Ackley 10 0.105 0.075 0.505 0.252 0.718 0.088 0.256 0.176 0.525 0.251 0.276 0.113 0.559 0.078 0.731 0.040 0.226 0.094 0.642 0.045 1.000 0.000 0.675 0.062 0.437 0.084 0.648 0.056 Levy 10 0.833 0.093 0.836 0.099 0.950 0.020 0.828 0.155 0.905 0.039 0.866 0.039 0.897 0.062 0.914 0.065 0.927 0.034 0.928 0.043 0.955 0.052 0.864 0.080 0.722 0.130 0.773 0.141 Powell 10 0.961 0.028 0.949 0.037 0.898 0.042 0.933 0.046 0.906 0.090 0.877 0.068 0.927 0.052 0.928 0.055 0.846 0.118 0.940 0.043 0.798 0.139 0.960 0.020 0.741 0.217 0.962 0.020 Rastrigin 10 0.397 0.148 0.277 0.123 0.378 0.101 0.274 0.157 0.348 0.162 0.325 0.159 0.389 0.147 0.418 0.100 0.266 0.151 0.386 0.075 0.759 0.396 0.421 0.069 0.318 0.128 0.449 0.112 Rosenbrock 10 0.985 0.011 0.982 0.014 0.978 0.009 0.979 0.013 0.967 0.023 0.957 0.032 0.948 0.042 0.976 0.017 0.991 0.007 0.979 0.012 0.930 0.050 0.984 0.010 0.913 0.104 0.989 0.013 Styblinski-Tang 10 0.689 0.127 0.706 0.133 0.274 0.182 0.675 0.149 0.455 0.134 0.182 0.153 0.408 0.140 0.597 0.154 0.105 0.088 0.624 0.128 0.220 0.163 0.660 0.168 0.170 0.130 0.585 0.143 Ackley 20 0.032 0.029 0.198 0.290 0.600 0.148 0.058 0.038 0.309 0.233 0.229 0.080 0.397 0.085 0.685 0.062 0.337 0.082 0.582 0.037 0.111 0.313 0.576 0.077 0.365 0.067 0.173 0.106 Levy 20 0.714 0.091 0.656 0.146 0.895 0.047 0.720 0.094 0.864 0.061 0.798 0.054 0.892 0.043 0.831 0.055 0.879 0.060 0.774 0.067 0.291 0.185 0.646 0.086 0.870 0.063 0.682 0.103 Powell 20 0.889 0.098 0.919 0.055 0.899 0.061 0.890 0.072 0.911 0.041 0.897 0.038 0.908 0.074 0.924 0.031 0.848 0.082 0.914 0.051 0.355 0.213 0.931 0.036 0.875 0.073 0.839 0.060 Rastrigin 20 0.243 0.101 0.213 0.115 0.352 0.101 0.278 0.078 0.323 0.082 0.326 0.080 0.425 0.093 0.335 0.079 0.300 0.066 0.353 0.074 0.087 0.075 0.278 0.127 0.330 0.095 0.224 0.122 Rosenbrock 20 0.952 0.016 0.973 0.015 0.989 0.005 0.961 0.020 0.975 0.011 0.945 0.025 0.957 0.020 0.979 0.006 0.974 0.011 0.973 0.010 0.434 0.307 0.979 0.006 0.964 0.021 0.914 0.066 Styblinski-Tang 20 0.626 0.115 0.596 0.093 0.381 0.111 0.601 0.089 0.519 0.120 0.262 0.107 0.272 0.065 0.570 0.108 0.186 0.112 0.562 0.094 0.131 0.101 0.611 0.094 0.348 0.072 0.391 0.080 Ackley 50 0.020 0.014 0.011 0.006 0.035 0.031 0.013 0.012 0.017 0.011 0.038 0.006 0.153 0.161 0.032 0.017 0.342 0.128 0.153 0.088 0.012 0.006 0.019 0.005 0.093 0.065 0.065 0.012 Levy 50 0.473 0.087 0.578 0.169 0.463 0.117 0.490 0.066 0.403 0.113 0.541 0.150 0.647 0.180 0.439 0.111 0.760 0.068 0.609 0.137 0.284 0.257 0.506 0.094 0.518 0.131 0.523 0.082 Powell 50 0.758 0.073 0.785 0.059 0.851 0.037 0.706 0.090 0.777 0.084 0.752 0.102 0.800 0.148 0.834 0.061 0.919 0.025 0.790 0.069 0.367 0.104 0.853 0.036 0.843 0.053 0.766 0.054 Rastrigin 50 0.164 0.042 0.173 0.040 0.200 0.043 0.186 0.035 0.172 0.062 0.186 0.088 0.231 0.129 0.206 0.056 0.306 0.096 0.270 0.078 0.085 0.051 0.157 0.090 0.169 0.069 0.300 0.055 Rosenbrock 50 0.822 0.025 0.834 0.063 0.903 0.034 0.743 0.055 0.789 0.085 0.692 0.074 0.892 0.068 0.875 0.049 0.949 0.017 0.882 0.033 0.423 0.119 0.861 0.034 0.893 0.021 0.787 0.090 Styblinski-Tang 50 0.449 0.032 0.454 0.061 0.434 0.067 0.444 0.069 0.427 0.047 0.325 0.067 0.263 0.083 0.455 0.035 0.285 0.060 0.432 0.047 0.167 0.215 0.529 0.039 0.426 0.059 0.287 0.094 Ackley 100 0.050 0.022 0.050 0.014 0.180 0.075 0.039 0.010 0.044 0.020 0.134 0.034 0.096 0.042 0.066 0.018 0.350 0.090 0.145 0.069 0.008 0.004 0.052 0.016 0.016 0.006 0.044 0.011 Emb. Hartmann 6 100 0.720 0.133 0.873 0.102 0.868 0.083 0.732 0.166 0.745 0.186 0.735 0.168 0.688 0.251 0.825 0.153 0.845 0.132 0.692 0.258 0.463 0.194 0.845 0.163 0.536 0.231 0.845 0.087 Levy 100 0.532 0.132 0.677 0.066 0.722 0.111 0.633 0.040 0.608 0.045 0.745 0.063 0.595 0.152 0.676 0.066 0.855 0.027 0.664 0.097 0.143 0.051 0.691 0.052 0.284 0.062 0.467 0.048 Powell 100 0.699 0.102 0.772 0.063 0.898 0.050 0.701 0.114 0.730 0.081 0.880 0.049 0.717 0.144 0.833 0.051 0.959 0.012 0.805 0.047 0.319 0.137 0.835 0.020 0.443 0.146 0.741 0.060 Rastrigin 100 0.270 0.043 0.357 0.062 0.351 0.032 0.306 0.029 0.320 0.065 0.352 0.035 0.271 0.075 0.336 0.037 0.412 0.031 0.352 0.059 0.054 0.038 0.320 0.028 0.111 0.032 0.271 0.041 Rosenbrock 100 0.701 0.113 0.746 0.058 0.883 0.041 0.679 0.072 0.708 0.084 0.830 0.074 0.668 0.093 0.789 0.060 0.927 0.025 0.742 0.052 0.218 0.152 0.797 0.028 0.383 0.128 0.614 0.046 Styblinski-Tang 100 0.402 0.041 0.364 0.021 0.355 0.047 0.361 0.052 0.347 0.067 0.315 0.037 0.309 0.031 0.361 0.047 0.330 0.054 0.374 0.046 0.106 0.045 0.416 0.045 0.163 0.060 0.305 0.063

Mean 0.560 0.625 0.670 0.613 0.633 0.605 0.632 0.681 0.647 0.676 0.449 0.672 0.545 0.574 Median 0.626 0.684 0.851 0.701 0.730 0.735 0.688 0.789 0.839 0.742 0.355 0.743 0.518 0.614

Table A7: BO on noise-free synthetic test problems. The relative batch instantaneous regret of the last, exploitative batch with q=5 is shown. Colors are normalized row-wise. Lower means better. Results are means over ten replicate runs.

Problem d mean BEEBO max BEEBO q-UCB q-EI TS KB GIBBON Tu RBO

T =0.05 T =0.5 T =5.0 T =0.05 T =0.5 T =5.0 κ=0.1 κ=1.0 κ=10.0 - - - default -

Ackley 2 1.003 0.296 0.916 0.306 0.166 0.097 0.372 0.233 0.336 0.149 0.229 0.091 0.867 0.083 0.836 0.101 0.791 0.130 0.838 0.207 0.631 0.269 0.963 0.119 0.933 0.080 0.622 0.346 Levy 2 0.288 0.220 0.238 0.371 0.037 0.039 0.034 0.051 0.021 0.021 0.027 0.040 1.217 0.686 0.856 0.832 1.212 1.060 0.626 0.535 1.300 2.236 0.255 0.446 1.409 0.605 0.219 0.232 Rastrigin 2 0.951 0.343 0.819 0.520 0.396 0.325 0.579 0.320 0.480 0.401 0.487 0.220 1.027 0.344 0.913 0.151 0.988 0.188 0.768 0.268 0.806 0.400 0.958 0.329 1.293 0.372 0.612 0.319 Rosenbrock 2 0.009 0.009 0.018 0.032 0.006 0.007 0.117 0.347 0.003 0.006 0.023 0.027 6.307 14.676 2.837 3.999 3.306 4.617 1.373 2.340 0.162 0.163 0.061 0.119 8.714 13.216 0.005 0.013 Styblinski-Tang 2 0.446 0.490 0.478 0.463 0.354 0.456 0.288 0.265 0.202 0.125 0.306 0.389 0.974 0.852 0.994 0.561 0.954 0.721 1.003 0.785 0.987 0.567 0.779 0.702 1.047 0.362 0.298 0.195 Shekel 4 0.892 0.082 0.817 0.064 0.816 0.077 0.902 0.055 0.879 0.077 0.894 0.069 0.970 0.020 0.938 0.018 0.964 0.019 0.949 0.063 0.952 0.051 0.958 0.079 0.984 0.037 0.848 0.050 Hartmann 6 0.157 0.179 0.089 0.088 0.110 0.093 0.106 0.118 0.135 0.154 0.297 0.264 0.749 0.128 0.781 0.104 0.794 0.161 0.634 0.222 0.467 0.228 0.568 0.489 0.747 0.277 0.175 0.108 Cosine 8 0.425 0.207 0.240 0.079 0.117 0.045 0.301 0.111 0.233 0.066 0.283 0.040 0.719 0.203 0.728 0.212 0.742 0.205 0.454 0.261 0.562 0.139 0.464 0.199 0.994 0.781 0.399 0.120 Ackley 10 0.942 0.078 0.732 0.296 0.456 0.162 0.879 0.092 0.586 0.220 0.695 0.159 0.917 0.052 0.711 0.091 0.893 0.072 0.784 0.180 0.899 0.113 0.672 0.180 0.824 0.113 0.760 0.108 Levy 10 0.420 0.226 0.325 0.185 0.053 0.045 0.273 0.132 0.135 0.060 0.117 0.083 0.715 0.489 0.985 0.500 0.911 0.554 0.434 0.521 0.610 0.269 0.337 0.279 0.720 0.741 0.408 0.249 Powell 10 0.008 0.004 0.009 0.006 0.014 0.013 0.012 0.010 0.022 0.012 0.045 0.047 0.583 0.519 0.748 0.479 0.757 0.848 0.080 0.069 0.535 0.308 0.135 0.258 0.288 0.327 0.026 0.019 Rastrigin 10 0.644 0.111 0.666 0.171 0.583 0.102 0.708 0.092 0.658 0.141 0.648 0.049 0.838 0.081 0.864 0.110 0.881 0.097 0.754 0.105 0.941 0.118 0.755 0.237 0.792 0.131 0.729 0.095 Rosenbrock 10 0.009 0.006 0.004 0.003 0.003 0.002 0.009 0.010 0.014 0.014 0.023 0.021 0.525 0.217 0.614 0.505 0.439 0.279 0.058 0.053 0.451 0.217 0.052 0.047 0.211 0.118 0.043 0.030 Styblinski-Tang 10 0.319 0.220 0.230 0.099 0.522 0.121 0.266 0.117 0.304 0.107 0.601 0.178 0.866 0.242 0.879 0.268 1.208 0.312 1.015 0.783 0.953 0.171 1.043 0.977 0.724 0.222 0.475 0.190 Ackley 20 0.967 0.033 0.946 0.082 0.634 0.183 0.930 0.079 0.807 0.185 0.834 0.093 0.881 0.095 0.906 0.078 0.914 0.046 0.820 0.142 0.981 0.069 0.854 0.125 0.856 0.115 0.965 0.016 Levy 20 0.475 0.205 0.377 0.194 0.159 0.092 0.313 0.220 0.245 0.067 0.200 0.111 0.824 0.475 0.858 0.201 0.669 0.316 0.341 0.103 0.895 0.112 0.315 0.139 0.345 0.131 0.588 0.139 Powell 20 0.053 0.029 0.043 0.021 0.039 0.029 0.054 0.039 0.053 0.031 0.075 0.059 0.758 0.526 1.041 0.665 0.728 0.337 0.058 0.025 1.196 0.463 0.052 0.028 0.111 0.028 0.294 0.235 Rastrigin 20 0.781 0.187 0.738 0.201 0.825 0.122 0.793 0.163 0.726 0.092 0.722 0.158 0.880 0.161 0.970 0.097 0.878 0.156 0.898 0.140 1.044 0.074 0.827 0.142 0.774 0.063 0.872 0.084 Rosenbrock 20 0.037 0.019 0.024 0.015 0.016 0.009 0.036 0.017 0.031 0.020 0.039 0.021 0.521 0.320 0.776 0.334 0.402 0.350 0.043 0.028 0.984 0.206 0.027 0.014 0.446 1.091 0.147 0.089 Styblinski-Tang 20 0.694 0.669 0.574 0.508 0.424 0.093 0.356 0.096 0.380 0.055 0.532 0.090 0.866 0.173 0.897 0.178 0.932 0.234 0.481 0.071 0.990 0.102 1.028 1.281 0.619 0.177 0.741 0.128 Ackley 50 0.989 0.013 0.991 0.006 0.974 0.018 0.989 0.009 0.989 0.004 0.979 0.014 0.998 0.004 0.998 0.004 0.968 0.024 0.967 0.042 1.002 0.004 0.984 0.007 0.972 0.013 0.966 0.009 Levy 50 1.403 1.462 1.022 1.053 0.558 0.117 0.583 0.083 0.637 0.216 0.549 0.100 1.220 0.283 1.253 0.393 0.852 0.184 0.567 0.154 1.072 0.144 0.499 0.129 0.603 0.114 0.559 0.095 Powell 50 0.151 0.026 0.141 0.031 0.150 0.038 0.194 0.055 0.173 0.035 0.151 0.039 1.274 0.495 1.198 0.589 0.798 0.283 0.317 0.317 1.052 0.184 0.126 0.036 0.140 0.047 0.303 0.089 Rastrigin 50 0.913 0.306 0.846 0.139 0.871 0.066 0.823 0.089 0.794 0.076 0.817 0.051 1.039 0.042 0.960 0.040 0.919 0.065 0.900 0.085 1.008 0.042 0.882 0.077 0.839 0.081 0.777 0.061 Rosenbrock 50 0.223 0.085 0.193 0.077 0.236 0.110 0.284 0.102 0.227 0.063 0.257 0.069 1.167 0.286 0.977 0.376 0.697 0.180 0.295 0.197 1.074 0.140 0.198 0.053 0.234 0.075 0.395 0.076 Styblinski-Tang 50 0.499 0.058 0.617 0.278 0.537 0.045 0.540 0.046 0.570 0.062 0.677 0.069 1.178 0.251 1.321 0.360 0.917 0.178 0.664 0.179 0.996 0.064 0.542 0.069 0.726 0.133 0.745 0.061 Ackley 100 0.959 0.016 0.965 0.016 0.920 0.022 0.962 0.013 0.938 0.012 0.915 0.021 0.989 0.007 0.989 0.009 0.957 0.016 0.938 0.022 1.001 0.004 0.961 0.009 0.994 0.005 0.972 0.004 Emb. Hartmann 6 100 0.430 0.214 0.371 0.159 0.348 0.242 0.389 0.209 0.532 0.173 0.362 0.128 0.887 0.071 0.860 0.081 0.851 0.044 0.474 0.231 0.909 0.086 0.300 0.162 0.960 0.084 0.390 0.135 Levy 100 0.465 0.100 0.397 0.052 0.363 0.051 0.345 0.040 0.354 0.087 0.301 0.077 1.039 0.208 0.952 0.070 0.853 0.024 0.362 0.096 1.024 0.063 0.362 0.045 0.882 0.094 0.588 0.049 Powell 100 0.231 0.064 0.233 0.066 0.139 0.049 0.224 0.046 0.181 0.036 0.099 0.048 0.886 0.129 1.024 0.252 0.760 0.107 0.516 0.903 0.982 0.128 0.195 0.071 0.769 0.070 0.306 0.053 Rastrigin 100 0.709 0.048 0.706 0.072 0.686 0.040 0.710 0.095 0.679 0.058 0.671 0.037 0.939 0.030 0.922 0.030 0.924 0.028 0.719 0.068 1.006 0.037 0.706 0.057 0.943 0.030 0.790 0.021 Rosenbrock 100 0.323 0.064 0.328 0.052 0.248 0.048 0.366 0.084 0.316 0.082 0.227 0.051 0.968 0.152 0.935 0.113 0.830 0.078 0.378 0.172 1.027 0.062 0.255 0.055 0.818 0.091 0.443 0.061 Styblinski-Tang 100 0.577 0.055 0.581 0.060 0.607 0.042 0.612 0.054 0.625 0.043 0.615 0.045 0.987 0.114 0.941 0.093 0.916 0.051 0.649 0.058 1.000 0.050 0.592 0.050 0.912 0.091 0.704 0.060

Mean 0.527 0.475 0.375 0.435 0.402 0.415 1.075 0.987 0.927 0.611 0.894 0.536 0.989 0.520 Median 0.465 0.397 0.354 0.356 0.336 0.306 0.917 0.939 0.881 0.634 0.984 0.542 0.818 0.559

Table A8: BO on noise-free synthetic test problems. The relative batch instantaneous regret of the last, exploitative batch with q=10 is shown. Colors are normalized row-wise. Lower means better. Results are means over ten replicate runs.

Problem d mean BEEBO max BEEBO q-UCB q-EI TS KB GIBBON Tu RBO

T =0.05 T =0.5 T =5.0 T =0.05 T =0.5 T =5.0 κ=0.1 κ=1.0 κ=10.0 - - - default -

Ackley 2 0.935 0.250 0.582 0.306 0.213 0.092 0.333 0.167 0.221 0.113 0.181 0.081 0.907 0.036 0.887 0.061 0.885 0.053 0.749 0.093 0.622 0.333 0.846 0.240 0.958 0.029 0.530 0.402 Levy 2 0.157 0.099 0.060 0.075 0.063 0.052 0.033 0.027 0.087 0.039 0.077 0.067 1.166 0.735 1.137 1.309 1.049 0.691 0.638 0.350 0.407 0.453 0.073 0.113 1.996 1.164 0.138 0.148 Rastrigin 2 0.631 0.269 0.676 0.294 0.404 0.209 0.566 0.189 0.482 0.399 0.662 0.296 0.994 0.255 0.927 0.213 1.016 0.234 0.608 0.109 0.831 0.427 0.708 0.121 1.217 0.214 0.292 0.299 Rosenbrock 2 0.002 0.002 0.002 0.004 0.005 0.004 0.003 0.003 0.002 0.002 0.007 0.012 1.054 0.835 1.335 1.611 2.055 4.821 0.471 0.644 0.116 0.228 0.009 0.011 2.095 3.809 0.001 0.001 Styblinski-Tang 2 0.233 0.080 0.124 0.071 0.371 0.396 0.190 0.181 0.290 0.353 0.234 0.259 1.129 0.298 1.104 0.304 0.861 0.295 0.930 0.436 0.359 0.312 0.294 0.400 1.273 0.469 0.261 0.105 Shekel 4 0.825 0.067 0.827 0.039 0.783 0.095 0.862 0.050 0.827 0.078 0.871 0.039 0.983 0.014 0.965 0.015 0.978 0.012 0.936 0.031 0.968 0.034 1.007 0.013 0.990 0.012 0.720 0.138 Hartmann 6 0.189 0.110 0.118 0.070 0.106 0.096 0.059 0.080 0.061 0.064 0.126 0.098 0.861 0.097 0.853 0.079 0.775 0.092 0.834 0.169 0.347 0.157 0.748 0.293 0.876 0.085 0.039 0.027 Cosine 8 0.217 0.084 0.159 0.100 0.167 0.073 0.213 0.066 0.188 0.064 0.270 0.092 0.831 0.118 0.917 0.166 0.797 0.215 1.055 0.570 0.570 0.194 0.793 0.604 1.003 0.159 0.200 0.067 Ackley 10 0.866 0.093 0.548 0.223 0.277 0.090 0.734 0.162 0.531 0.226 0.707 0.123 0.833 0.061 0.842 0.063 0.924 0.044 0.891 0.102 0.859 0.026 0.965 0.095 0.972 0.018 0.445 0.082 Levy 10 0.196 0.167 0.143 0.123 0.024 0.010 0.123 0.098 0.086 0.055 0.118 0.064 0.936 0.297 0.845 0.284 0.866 0.284 0.504 0.386 0.591 0.168 0.305 0.230 0.747 0.286 0.139 0.060 Powell 10 0.012 0.016 0.017 0.015 0.018 0.017 0.027 0.026 0.032 0.017 0.056 0.033 0.885 0.507 0.948 0.604 0.985 0.461 0.213 0.204 0.682 0.268 0.322 0.709 0.442 0.327 0.009 0.007 Rastrigin 10 0.773 0.187 0.775 0.116 0.615 0.122 0.714 0.177 0.593 0.117 0.598 0.169 0.869 0.097 0.909 0.084 0.856 0.128 0.719 0.042 0.828 0.090 0.738 0.149 0.791 0.067 0.423 0.073 Rosenbrock 10 0.008 0.008 0.006 0.005 0.004 0.002 0.011 0.008 0.013 0.007 0.034 0.017 0.550 0.287 0.684 0.259 0.744 0.178 0.155 0.115 0.303 0.136 0.087 0.072 0.633 0.537 0.005 0.006 Styblinski-Tang 10 0.276 0.155 0.389 0.329 0.523 0.135 0.257 0.086 0.419 0.077 0.545 0.070 1.036 0.164 1.079 0.255 1.074 0.198 1.060 0.498 0.907 0.125 1.064 0.761 0.926 0.186 0.256 0.084 Ackley 20 0.957 0.040 0.858 0.204 0.392 0.146 0.923 0.048 0.674 0.233 0.757 0.071 0.850 0.045 0.822 0.088 0.899 0.044 0.778 0.141 0.984 0.034 0.492 0.078 0.978 0.012 0.871 0.062 Levy 20 0.232 0.066 0.685 0.872 0.065 0.033 0.262 0.209 0.099 0.039 0.135 0.034 0.683 0.291 0.970 0.274 0.794 0.390 0.348 0.134 0.950 0.119 0.425 0.183 0.431 0.394 0.327 0.102 Powell 20 0.078 0.144 0.025 0.017 0.029 0.022 0.038 0.027 0.030 0.012 0.036 0.011 0.568 0.241 0.748 0.472 0.603 0.397 0.082 0.039 0.937 0.209 0.035 0.014 0.126 0.054 0.098 0.057 Rastrigin 20 0.739 0.102 0.730 0.076 0.707 0.074 0.732 0.101 0.660 0.102 0.660 0.099 0.843 0.119 0.865 0.055 0.824 0.059 0.712 0.035 0.999 0.053 0.747 0.093 0.724 0.061 0.783 0.081 Rosenbrock 20 0.019 0.007 0.013 0.008 0.004 0.002 0.018 0.009 0.011 0.005 0.031 0.030 0.461 0.335 0.784 0.288 0.410 0.200 0.036 0.008 0.788 0.119 0.018 0.007 0.095 0.039 0.074 0.045 Styblinski-Tang 20 0.268 0.074 0.288 0.069 0.445 0.058 0.311 0.082 0.410 0.117 0.679 0.152 1.104 0.212 1.104 0.216 1.080 0.296 0.628 0.224 0.985 0.078 0.852 1.063 0.859 0.657 0.523 0.131 Ackley 50 0.981 0.018 0.990 0.005 0.960 0.034 0.988 0.015 0.987 0.009 0.961 0.011 0.960 0.046 0.994 0.007 0.862 0.058 0.893 0.085 1.001 0.005 0.982 0.005 0.934 0.064 0.949 0.010 Levy 50 0.464 0.115 0.965 1.096 0.445 0.107 0.482 0.119 0.489 0.120 0.417 0.127 1.023 0.305 1.373 0.357 0.611 0.174 0.516 0.309 1.032 0.194 1.034 1.514 0.554 0.194 0.487 0.055 Powell 50 0.127 0.030 0.109 0.018 0.076 0.032 0.147 0.031 0.113 0.037 0.177 0.076 1.177 0.598 1.302 0.455 0.522 0.332 0.269 0.418 0.972 0.192 0.082 0.031 0.398 0.913 0.218 0.062 Rastrigin 50 0.889 0.124 0.858 0.050 0.764 0.111 0.804 0.065 0.789 0.045 0.776 0.085 0.943 0.093 0.940 0.037 0.829 0.063 0.826 0.133 0.997 0.046 1.000 0.323 0.832 0.041 0.696 0.035 Rosenbrock 50 0.131 0.030 0.123 0.054 0.075 0.035 0.197 0.043 0.159 0.073 0.249 0.103 0.784 0.390 1.066 0.234 0.417 0.180 0.107 0.023 1.039 0.137 0.117 0.042 0.135 0.056 0.285 0.085 Styblinski-Tang 50 0.450 0.039 0.500 0.148 0.534 0.238 0.467 0.089 0.487 0.068 0.593 0.053 1.050 0.210 1.313 0.213 0.797 0.117 0.520 0.073 0.990 0.055 0.633 0.631 0.804 0.821 0.697 0.068 Ackley 100 0.950 0.023 0.948 0.016 0.815 0.075 0.961 0.010 0.954 0.019 0.861 0.033 0.968 0.022 0.992 0.010 0.914 0.056 0.905 0.057 0.999 0.002 0.949 0.015 0.992 0.002 0.959 0.007 Emb. Hartmann 6 100 0.265 0.179 0.118 0.092 0.107 0.085 0.220 0.151 0.199 0.155 0.208 0.125 0.784 0.109 0.815 0.127 0.756 0.133 0.543 0.163 0.870 0.048 0.157 0.175 0.941 0.065 0.301 0.120 Levy 100 0.402 0.108 0.291 0.069 0.242 0.083 0.328 0.058 0.351 0.077 0.247 0.053 1.017 0.196 1.154 0.164 0.976 0.179 0.442 0.386 0.999 0.040 0.318 0.120 0.821 0.054 0.515 0.037 Powell 100 0.189 0.063 0.141 0.031 0.066 0.024 0.190 0.067 0.166 0.047 0.083 0.027 0.865 0.237 1.232 0.274 0.861 0.315 0.141 0.029 0.916 0.092 0.112 0.020 0.727 0.063 0.226 0.039 Rastrigin 100 0.726 0.068 0.702 0.077 0.628 0.042 0.709 0.084 0.666 0.085 0.636 0.046 0.946 0.057 0.946 0.051 0.806 0.090 0.788 0.119 1.002 0.027 0.730 0.058 0.927 0.024 0.715 0.037 Rosenbrock 100 0.227 0.076 0.200 0.059 0.093 0.040 0.255 0.080 0.229 0.080 0.153 0.080 0.880 0.167 1.060 0.106 0.823 0.223 0.225 0.047 0.988 0.070 0.164 0.016 0.766 0.057 0.366 0.047 Styblinski-Tang 100 0.527 0.052 0.568 0.036 0.598 0.054 0.566 0.044 0.581 0.036 0.625 0.029 1.051 0.145 1.267 0.286 0.965 0.017 0.667 0.271 0.987 0.039 0.522 0.020 0.898 0.033 0.677 0.041

Mean 0.422 0.410 0.322 0.386 0.360 0.387 0.909 1.005 0.867 0.581 0.813 0.525 0.844 0.401 Median 0.268 0.291 0.242 0.262 0.290 0.249 0.936 0.965 0.861 0.628 0.937 0.522 0.859 0.327

D.3 Control problems

Figure A2: Experiments on the 14D robot arm pushing and 60D rover trajectory planning control problems. 10 replicates each. GIBBON (s) refers to the scaled larged-batch variant of GIBBON.

D.4 Run time

Figure A3: Example run times for the 10-round BO experiment on the 6D Hartmann problem with Q=100. Error bars are over 5 replicate runs. Run times vary depending on the test problem, with GIBBON appearing especially sensitive, becoming e.g. 10x slower on the 50D Ackley problem.

Table A9: Total run times for five replicates of the experiments presented in Table A1. We sum over all test problems.

Method Configuration Total time [h]

mean BEEBO T = 0.05 66.12 mean BEEBO T = 0.5 47.08 mean BEEBO T = 5.0 37.13 max BEEBO T = 0.05 54.85 max BEEBO T = 0.5 44.20 max BEEBO T = 5.0 47.63 q-UCB κ = 0.1 3.33 q-UCB κ = 1.0 3.70 q-UCB κ = 10.0 4.41 q-EI - 24.33 TS - 6.56 KB - 223.78 GIBBON default 3380.48 GIBBON scaled 1055.93

D.5 Results with random initialization in round 0

Table A10: BO with random initialization on noise-free synthetic test problems. The normalized highest observed value after 10 rounds of BO with q=100 is shown. Colors are normalized row-wise. Higher means better. Results are means over five replicate runs.

Problem d mean BEEBO max BEEBO q-UCB q-EI TS KB Tu RBO

T =0.05 T =0.5 T =5.0 T =0.05 T =0.5 T =5.0 κ=0.1 κ=1.0 κ=10.0 - - - -

Ackley 2 0.994 0.011 0.971 0.041 0.988 0.006 0.985 0.012 0.977 0.029 0.912 0.069 0.793 0.255 0.904 0.059 0.968 0.019 0.957 0.056 1.000 0.000 0.928 0.076 0.972 0.027 Levy 2 0.995 0.008 0.997 0.002 0.995 0.005 0.997 0.002 0.998 0.002 0.966 0.044 0.983 0.023 0.985 0.010 0.954 0.047 0.994 0.004 0.993 0.004 0.997 0.003 0.992 0.017 Rastrigin 2 0.838 0.190 0.674 0.433 0.836 0.138 0.784 0.350 0.756 0.356 0.847 0.199 0.540 0.359 0.675 0.250 0.296 0.255 0.873 0.173 0.800 0.447 0.691 0.299 0.902 0.220 Rosenbrock 2 0.893 0.080 0.900 0.103 0.605 0.395 0.525 0.496 0.705 0.413 0.242 0.306 0.678 0.394 0.475 0.452 0.469 0.485 0.728 0.304 0.888 0.249 0.753 0.423 0.875 0.266 Styblinski-Tang 2 1.000 0.001 1.000 0.000 1.000 0.000 0.998 0.002 0.999 0.002 0.999 0.002 0.998 0.001 0.999 0.002 0.993 0.008 0.999 0.001 0.998 0.002 1.000 0.000 1.000 0.000 Shekel 4 0.839 0.311 0.829 0.360 0.706 0.326 0.390 0.353 0.476 0.274 0.388 0.316 0.183 0.068 0.376 0.308 0.266 0.073 0.530 0.350 0.087 0.060 0.178 0.071 0.824 0.247 Hartmann 6 1.000 0.000 1.000 0.000 0.984 0.016 0.955 0.063 0.989 0.024 0.992 0.010 0.960 0.052 0.998 0.002 0.919 0.057 0.991 0.020 0.715 0.150 0.998 0.001 0.965 0.045 Cosine 8 1.000 0.000 0.997 0.002 0.377 0.144 0.999 0.000 0.968 0.023 0.871 0.057 0.906 0.079 0.903 0.050 0.390 0.242 0.787 0.118 1.000 0.000 0.987 0.009 0.912 0.066 Ackley 10 0.935 0.026 0.904 0.039 0.820 0.037 0.816 0.037 0.742 0.052 0.507 0.137 0.790 0.015 0.776 0.038 0.561 0.170 0.782 0.036 1.000 0.000 0.789 0.023 0.784 0.026 Levy 10 0.981 0.013 0.957 0.028 0.930 0.034 0.956 0.026 0.926 0.024 0.919 0.022 0.866 0.036 0.839 0.068 0.813 0.126 0.920 0.064 0.942 0.026 0.949 0.041 0.941 0.072 Powell 10 0.971 0.029 0.953 0.030 0.672 0.387 0.923 0.109 0.900 0.103 0.702 0.398 0.886 0.061 0.822 0.161 0.219 0.302 0.851 0.191 0.833 0.169 0.939 0.058 0.985 0.008 Rastrigin 10 0.465 0.144 0.495 0.180 0.570 0.091 0.526 0.087 0.516 0.140 0.642 0.083 0.394 0.143 0.564 0.152 0.193 0.172 0.441 0.065 1.000 0.000 0.394 0.088 0.672 0.165 Rosenbrock 10 0.992 0.006 0.990 0.002 0.865 0.068 0.990 0.003 0.986 0.007 0.966 0.037 0.965 0.043 0.952 0.019 0.220 0.373 0.975 0.017 0.820 0.029 0.992 0.003 0.993 0.004 Styblinski-Tang 10 0.815 0.051 0.814 0.062 0.245 0.050 0.784 0.025 0.643 0.124 0.217 0.157 0.165 0.115 0.584 0.177 0.028 0.062 0.619 0.093 0.399 0.163 0.827 0.066 0.654 0.085 Robot Pushing 14 0.350 0.121 0.377 0.124 0.560 0.172 0.425 0.107 0.310 0.140 0.395 0.131 0.424 0.154 0.522 0.170 0.379 0.093 0.417 0.160 0.247 0.118 0.694 0.255 0.518 0.149 Ackley 20 0.843 0.016 0.857 0.028 0.789 0.048 0.819 0.030 0.788 0.040 0.390 0.106 0.706 0.063 0.775 0.039 0.460 0.052 0.740 0.054 1.000 0.000 0.763 0.048 0.438 0.103 Levy 20 0.936 0.035 0.939 0.019 0.896 0.044 0.953 0.027 0.901 0.046 0.911 0.038 0.928 0.016 0.929 0.029 0.768 0.048 0.921 0.024 0.979 0.003 0.956 0.013 0.925 0.041 Powell 20 0.947 0.036 0.966 0.013 0.840 0.111 0.936 0.019 0.880 0.100 0.908 0.076 0.946 0.007 0.928 0.050 0.819 0.122 0.926 0.047 0.964 0.014 0.969 0.016 0.957 0.036 Rastrigin 20 0.373 0.042 0.462 0.049 0.518 0.054 0.514 0.059 0.480 0.077 0.491 0.087 0.450 0.115 0.463 0.069 0.383 0.094 0.451 0.053 1.000 0.000 0.481 0.068 0.523 0.064 Rosenbrock 20 0.992 0.004 0.993 0.004 0.920 0.041 0.991 0.005 0.982 0.013 0.923 0.054 0.967 0.018 0.984 0.005 0.915 0.044 0.984 0.007 0.939 0.020 0.994 0.003 0.993 0.002 Styblinski-Tang 20 0.706 0.061 0.669 0.113 0.305 0.198 0.607 0.086 0.417 0.122 0.279 0.077 0.204 0.130 0.524 0.184 0.054 0.074 0.639 0.082 0.271 0.262 0.665 0.129 0.604 0.088 Ackley 50 0.221 0.293 0.146 0.116 0.842 0.010 0.622 0.243 0.705 0.042 0.457 0.098 0.627 0.053 0.739 0.034 0.736 0.020 0.727 0.051 1.000 0.000 0.683 0.122 0.175 0.022 Levy 50 0.976 0.010 0.978 0.012 0.943 0.021 0.977 0.012 0.955 0.018 0.867 0.013 0.952 0.016 0.966 0.025 0.933 0.007 0.943 0.017 0.987 0.002 0.926 0.041 0.793 0.054 Powell 50 0.940 0.036 0.978 0.010 0.959 0.025 0.976 0.009 0.970 0.014 0.929 0.024 0.965 0.014 0.958 0.016 0.978 0.007 0.964 0.013 0.985 0.004 0.957 0.022 0.920 0.039 Rastrigin 50 0.273 0.156 0.505 0.031 0.453 0.040 0.473 0.042 0.466 0.016 0.445 0.027 0.466 0.073 0.418 0.023 0.503 0.047 0.468 0.012 1.000 0.000 0.423 0.085 0.459 0.070 Rosenbrock 50 0.976 0.012 0.985 0.003 0.988 0.005 0.978 0.011 0.981 0.005 0.968 0.016 0.974 0.004 0.987 0.003 0.984 0.013 0.981 0.003 0.979 0.003 0.987 0.003 0.968 0.018 Styblinski-Tang 50 0.605 0.067 0.693 0.038 0.417 0.163 0.536 0.072 0.415 0.055 0.332 0.037 0.341 0.082 0.716 0.031 0.371 0.040 0.690 0.069 0.254 0.220 0.720 0.041 0.499 0.073 Rover trajectory 60 0.448 0.209 0.678 0.029 0.708 0.060 0.533 0.066 0.657 0.104 0.629 0.040 0.626 0.076 0.613 0.070 0.665 0.079 0.635 0.074 0.265 0.069 0.616 0.043 0.764 0.074 Ackley 100 0.310 0.408 0.347 0.452 0.864 0.023 0.526 0.335 0.536 0.293 0.707 0.086 0.696 0.015 0.721 0.080 0.850 0.007 0.747 0.071 0.007 0.005 0.289 0.081 0.110 0.014 Emb. Hartmann 6 100 0.980 0.009 0.988 0.008 0.916 0.101 0.982 0.016 0.933 0.057 0.914 0.051 0.941 0.035 0.915 0.038 0.913 0.116 0.922 0.110 0.554 0.315 0.949 0.065 0.931 0.084 Levy 100 0.890 0.150 0.966 0.024 0.943 0.019 0.962 0.012 0.942 0.017 0.946 0.013 0.952 0.005 0.937 0.029 0.964 0.014 0.943 0.030 0.310 0.382 0.908 0.056 0.692 0.013 Powell 100 0.786 0.051 0.929 0.099 0.985 0.004 0.985 0.002 0.981 0.009 0.981 0.003 0.983 0.004 0.978 0.013 0.983 0.008 0.963 0.018 0.288 0.165 0.967 0.021 0.860 0.027 Rastrigin 100 0.522 0.027 0.367 0.194 0.467 0.027 0.481 0.057 0.479 0.007 0.469 0.037 0.442 0.021 0.442 0.047 0.432 0.020 0.493 0.014 0.238 0.426 0.674 0.126 0.394 0.022 Rosenbrock 100 0.810 0.029 0.972 0.012 0.976 0.008 0.928 0.119 0.978 0.006 0.975 0.008 0.977 0.008 0.968 0.012 0.985 0.008 0.974 0.003 0.270 0.406 0.943 0.058 0.857 0.010 Styblinski-Tang 100 0.564 0.034 0.470 0.152 0.396 0.079 0.432 0.043 0.331 0.061 0.309 0.022 0.321 0.027 0.542 0.130 0.280 0.017 0.591 0.023 0.198 0.273 0.593 0.063 0.412 0.034

Mean 0.776 0.793 0.751 0.779 0.762 0.697 0.714 0.768 0.618 0.788 0.692 0.788 0.750 Median 0.890 0.929 0.840 0.923 0.880 0.847 0.793 0.822 0.665 0.851 0.888 0.908 0.857

Table A11: BO with random initialization on noise-free synthetic test problems. The relative batch instantaneous regret of the last, exploitative batch is shown. Colors are normalized row-wise. Lower means better. Results are means over five replicate runs.

Problem d mean BEEBO max BEEBO q-UCB q-EI TS KB Tu RBO

T =0.05 T =0.5 T =5.0 T =0.05 T =0.5 T =5.0 κ=0.1 κ=1.0 κ=10.0 - - - -

Ackley 2 0.268 0.132 0.189 0.049 0.334 0.082 0.259 0.187 0.221 0.145 0.299 0.151 1.011 0.027 0.993 0.030 0.994 0.018 0.806 0.209 0.624 0.361 0.749 0.266 0.145 0.101 Levy 2 0.153 0.024 0.130 0.055 0.066 0.059 0.111 0.010 0.091 0.034 0.109 0.009 1.260 0.454 1.195 0.283 1.256 0.263 1.219 0.204 0.280 0.401 0.088 0.100 0.000 0.000 Rastrigin 2 0.427 0.019 0.600 0.381 0.523 0.228 0.306 0.210 0.543 0.073 0.491 0.053 1.009 0.061 0.991 0.104 1.047 0.058 0.808 0.060 0.728 0.196 0.851 0.103 0.031 0.063 Rosenbrock 2 0.001 0.000 0.001 0.000 0.002 0.001 0.003 0.002 0.002 0.000 0.003 0.003 0.895 0.134 0.898 0.131 0.917 0.264 1.101 0.303 0.002 0.001 0.003 0.004 0.000 0.000 Styblinski-Tang 2 0.173 0.007 0.170 0.009 0.170 0.008 0.169 0.007 0.171 0.008 0.170 0.008 1.118 0.087 1.046 0.080 1.047 0.097 0.751 0.154 0.471 0.591 0.169 0.320 0.000 0.000 Shekel 4 0.790 0.049 0.635 0.094 0.707 0.047 0.757 0.097 0.644 0.229 0.727 0.096 0.992 0.006 0.989 0.006 0.992 0.004 0.959 0.041 0.945 0.033 1.001 0.011 0.387 0.223 Hartmann 6 0.052 0.017 0.087 0.030 0.096 0.012 0.189 0.119 0.085 0.029 0.065 0.029 0.959 0.075 0.971 0.017 0.851 0.087 0.863 0.067 0.356 0.006 0.288 0.171 0.028 0.031 Cosine 8 0.060 0.119 0.004 0.006 0.304 0.037 0.000 0.000 0.015 0.010 0.062 0.033 0.987 0.097 0.971 0.071 0.966 0.068 1.111 0.214 0.436 0.018 1.217 0.099 0.080 0.053 Ackley 10 0.447 0.082 0.329 0.074 0.250 0.035 0.324 0.037 0.321 0.047 0.485 0.075 0.936 0.025 0.930 0.021 0.949 0.020 0.937 0.038 0.983 0.015 0.903 0.260 0.299 0.004 Levy 10 0.079 0.068 0.025 0.018 0.296 0.067 0.037 0.031 0.048 0.032 0.093 0.078 1.324 0.110 0.979 0.112 1.088 0.175 0.737 0.322 0.595 0.100 0.552 0.407 0.024 0.011 Powell 10 0.019 0.028 0.009 0.001 0.051 0.004 0.026 0.012 0.052 0.008 0.156 0.056 1.045 0.237 0.926 0.175 1.248 0.273 0.262 0.173 0.144 0.029 0.046 0.038 0.003 0.003 Rastrigin 10 0.625 0.076 0.550 0.082 0.599 0.129 0.533 0.133 0.524 0.119 0.420 0.151 0.911 0.017 0.930 0.018 0.921 0.016 0.961 0.070 0.763 0.108 0.926 0.100 0.355 0.163 Rosenbrock 10 0.003 0.002 0.004 0.002 0.083 0.008 0.016 0.011 0.010 0.005 0.065 0.014 0.895 0.170 0.803 0.115 0.962 0.127 0.044 0.014 0.085 0.011 0.004 0.001 0.001 0.000 Styblinski-Tang 10 0.200 0.023 0.225 0.042 0.571 0.053 0.228 0.028 0.333 0.018 0.487 0.068 1.236 0.149 1.216 0.053 1.184 0.027 1.173 0.326 0.815 0.044 0.676 0.113 0.170 0.049 Robot Pushing 14 0.800 0.177 0.797 0.081 0.879 0.082 0.970 0.057 0.972 0.057 0.986 0.058 0.970 0.026 0.795 0.123 0.984 0.031 0.892 0.103 0.949 0.043 0.673 0.077 0.506 0.070 Ackley 20 0.668 0.127 0.261 0.133 0.211 0.050 0.219 0.030 0.309 0.078 0.607 0.090 0.924 0.019 0.931 0.016 0.910 0.007 0.959 0.085 0.980 0.002 0.912 0.221 0.636 0.111 Levy 20 0.078 0.032 0.078 0.078 0.117 0.092 0.186 0.157 0.114 0.060 0.207 0.052 0.924 0.108 0.859 0.093 1.151 0.105 0.473 0.078 0.743 0.042 0.219 0.086 0.093 0.045 Powell 20 0.097 0.068 0.006 0.002 0.035 0.023 0.082 0.020 0.077 0.008 0.118 0.030 0.757 0.147 0.690 0.195 0.842 0.134 0.086 0.040 0.446 0.142 0.020 0.007 0.011 0.006 Rastrigin 20 0.713 0.083 0.614 0.063 0.506 0.090 0.618 0.049 0.644 0.077 0.562 0.006 0.860 0.048 0.833 0.015 0.850 0.022 0.923 0.228 0.864 0.018 0.725 0.020 0.476 0.102 Rosenbrock 20 0.038 0.039 0.004 0.002 0.029 0.019 0.117 0.064 0.055 0.054 0.050 0.016 0.645 0.113 0.587 0.109 0.978 0.124 0.065 0.022 0.394 0.125 0.008 0.004 0.005 0.001 Styblinski-Tang 20 0.405 0.163 0.357 0.111 0.730 0.069 0.396 0.047 0.529 0.030 0.560 0.038 1.161 0.102 1.093 0.077 1.167 0.034 1.193 0.399 0.903 0.037 0.723 0.090 0.257 0.059 Ackley 50 0.897 0.052 0.859 0.128 0.159 0.010 0.402 0.248 0.465 0.246 0.538 0.098 0.932 0.036 0.957 0.024 0.849 0.024 0.863 0.032 0.986 0.001 0.360 0.115 0.868 0.009 Levy 50 0.039 0.032 0.033 0.041 0.043 0.012 0.020 0.007 0.048 0.015 0.239 0.080 0.667 0.072 0.582 0.135 0.883 0.224 0.099 0.019 0.873 0.009 0.109 0.026 0.227 0.053 Powell 50 0.022 0.010 0.016 0.006 0.015 0.008 0.025 0.026 0.036 0.035 0.076 0.031 0.470 0.158 0.533 0.048 0.575 0.194 0.046 0.024 0.880 0.034 0.017 0.006 0.041 0.015 Rastrigin 50 0.753 0.083 0.601 0.031 0.891 0.420 0.587 0.062 0.585 0.055 0.579 0.093 0.798 0.052 0.820 0.021 0.792 0.028 0.644 0.031 0.932 0.003 0.769 0.203 0.558 0.042 Rosenbrock 50 0.016 0.007 0.010 0.003 0.007 0.003 0.014 0.006 0.035 0.024 0.051 0.021 0.656 0.165 0.540 0.085 0.698 0.174 0.030 0.006 0.794 0.039 0.012 0.003 0.057 0.033 Styblinski-Tang 50 0.460 0.223 1.063 1.782 0.709 0.130 0.449 0.113 0.574 0.097 0.721 0.032 1.032 0.109 1.134 0.117 0.995 0.045 0.613 0.127 0.960 0.012 0.436 0.334 0.483 0.059 Rover trajectory 60 0.475 0.127 0.403 0.115 0.511 0.187 0.380 0.083 0.485 0.110 0.679 0.139 0.684 0.150 0.450 0.145 0.473 0.115 0.561 0.060 0.923 0.029 0.479 0.160 0.186 0.042 Ackley 100 0.684 0.402 0.815 0.360 0.136 0.022 0.554 0.291 0.476 0.283 0.295 0.082 0.952 0.027 0.908 0.057 0.883 0.039 0.672 0.129 0.997 0.001 0.799 0.159 0.904 0.010 Emb. Hartmann 6 100 0.089 0.057 0.039 0.028 0.179 0.142 0.140 0.093 0.100 0.051 0.194 0.128 0.636 0.095 0.843 0.026 0.717 0.197 0.562 0.324 0.882 0.019 0.118 0.125 0.068 0.036 Levy 100 0.089 0.105 0.043 0.034 0.044 0.014 0.039 0.012 0.173 0.085 0.061 0.031 0.629 0.051 0.735 0.134 0.624 0.127 0.131 0.111 0.980 0.021 0.086 0.037 0.300 0.020 Powell 100 0.117 0.020 0.038 0.029 0.008 0.002 0.010 0.003 0.034 0.040 0.011 0.004 0.482 0.066 0.549 0.089 0.477 0.067 0.031 0.012 1.019 0.069 0.021 0.009 0.112 0.017 Rastrigin 100 0.503 0.074 0.584 0.192 0.548 0.029 0.474 0.068 0.550 0.075 0.554 0.061 0.759 0.021 0.825 0.041 0.774 0.031 0.628 0.049 0.990 0.007 0.833 0.365 0.584 0.009 Rosenbrock 100 0.119 0.018 0.026 0.010 0.015 0.004 0.059 0.071 0.061 0.049 0.036 0.019 0.415 0.078 0.600 0.100 0.502 0.068 0.199 0.102 0.982 0.040 0.043 0.037 0.141 0.012 Styblinski-Tang 100 0.372 0.034 0.448 0.158 0.505 0.074 0.547 0.103 0.635 0.145 0.774 0.097 0.947 0.064 1.211 0.140 0.910 0.048 0.429 0.077 0.988 0.009 0.349 0.051 0.527 0.035

Mean 0.307 0.287 0.295 0.264 0.286 0.329 0.882 0.866 0.899 0.624 0.734 0.434 0.245 Median 0.173 0.170 0.179 0.189 0.173 0.239 0.924 0.908 0.917 0.672 0.873 0.360 0.145

D.6 BO curves for all experiments in Table 2 and Table A1

Figure A4: Experiments on the Shekel, Hartmann, Cosine and embedded Hartmann test functions with κ = 0.1 for BEEBO and q-UCB.

Figure A5: Experiments on the Shekel, Hartmann, Cosine and embedded Hartmann test functions with κ = 1.0 for BEEBO and q-UCB.

Figure A6: Experiments on the Shekel, Hartmann, Cosine and embedded Hartmann test functions with κ = 10.0 for BEEBO and q-UCB.

Figure A7: Experiments on the Ackley test function with κ = 0.1 for BEEBO and q UCB.

Figure A8: Experiments on the Ackley test function with κ = 1.0 for BEEBO and q UCB.

Figure A9: Experiments on the Ackley test function with κ = 10.0 for BEEBO and q UCB.

Figure A10: Experiments on the Levy test function with κ = 0.1 for BEEBO and q UCB.

Figure A11: Experiments on the Levy test function with κ = 1.0 for BEEBO and q UCB.

Figure A12: Experiments on the Levy test function with κ = 10.0 for BEEBO and q UCB.

Figure A13: Experiments on the Rastrigin test function with κ = 0.1 for BEEBO and q-UCB.

Figure A14: Experiments on the Rastrigin test function with κ = 1.0 for BEEBO and q-UCB.

Figure A15: Experiments on the Rastrigin test function with κ = 10.0 for BEEBO and q-UCB.

Figure A16: Experiments on the Rosenbrock test function with κ = 0.1 for BEEBO and q-UCB.

Figure A17: Experiments on the Rosenbrock test function with κ = 1.0 for BEEBO and q-UCB.

Figure A18: Experiments on the Rosenbrock test function with κ = 10.0 for BEEBO and q-UCB.

Figure A19: Experiments on the Powell test function with κ = 0.1 for BEEBO and q-UCB.

Figure A20: Experiments on the Powell test function with κ = 1.0 for BEEBO and q-UCB.

Figure A21: Experiments on the Powell test function with κ = 10.0 for BEEBO and q-UCB.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: As claimed, we experimentally demonstrate a) the controllability of the acquisition strategy, b) competitive performance on 33 test problems compared to q-UCB, q EI, Thompson sampling, GIBBON, Tu RBO and Kriging Believer, and c) behaviour under heteroskedastic noise.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We address limitations in our discussion section, highlighting computational complexity constraints in exact GP inference as well as challenges under heteroskedastic noise.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: The paper does not make use of any theoretical results. All reported results are based on empirical experiments. All underlying assumptions are standard in research on BO with GPs. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: As described in the methods section, we use standard Bo Torch and GPy Torch utilities for all our experiments, and provide extended details on the technical implementation in the supplementary section. Our repository includes the full benchmarking setup with appropriate run scripts and instructions. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: The repository includes the implementation of the proposed method as well as the benchmarking setup with alternative methods. No additional data is required for reproduction.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: We follow GPy Torch and Bo Torch for all hyperparameters pertaining to GPs, and describe this accordingly. Our appendix includes additional details on method hyperparameters to ensure reproducibility.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We include full BO curves with standard deviations over five replicates for all quantitative experiments in the appendix. These detailed curves are referenced in the main text at the appropriate place.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: We list the used hardware and total GPU hours in the supplement and provide example timings for experiment runtimes.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: The paper does not make use of human participants or datasets. To the best of our understanding, there are no potential harmful consequences and wider negative societal impact expected from the proposed method.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA]

Justification: The paper introduces a method for Bayesian optimization (BO). While BO has widespread applications in the sciences and engineering, there is no direct societal impact expected from this contribution. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper does not introduce any trained models or novel data. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes] Justification: We credit the GPy Torch and Bo Torch packages that our codebase builds upon. The packages are used as dependencies, and as such are not included directly as assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: The implementation of BEEBO constitutes the only asset, which follows Bo Torch APIs and has a README file demonstrating its application when working in Bo Torch. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: None of the above are included in this paper. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The presented paper does not involve any human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.