# triangulation_candidates_for_bayesian_optimization__7a391b6a.pdf

Triangulation candidates for Bayesian optimization

Robert B. Gramacy Department of Statistics

Virginia Tech Blacksburg, VA 24060, USA

Annie Sauer Department of Statistics

Virginia Tech Blacksburg, VA 24060, USA

anniees@vt.edu

Nathan Wycoff The Mc Court School s Massive Data Institute

Georgetown University Washington DC, 20057, USA nathan.wycoff@georgetown.edu

Bayesian optimization involves inner optimization over a new-data acquisition criterion which is non-convex/highly multi-modal, may be non-differentiable, or may otherwise thwart local numerical optimizers. In such cases it is common to replace continuous search with a discrete one over random candidates. Here we propose using candidates based on a Delaunay triangulation of the existing input design. We detail the construction of these tricands and demonstrate empirically how they outperform both numerically optimized acquisitions and random candidate-based alternatives, and are well-suited for hybrid schemes, on benchmark synthetic and real simulation experiments.

1 Introduction

We address the continuous unconstrained optimization problem

x? = argmin

where the bounding box B is a hyperrectangle, often taken as [0, 1]d in coded inputs. The objective f : B ! R is a blackbox function, meaning that we can only learn about its behavior through expensive, often simulation-based, evaluation. Such problems are most challenging when f is highly non-convex, and thus contains multiple local minima. A tacit goal of a solver is to minimize the number of times that f is evaluated in search of a global solution.

The earliest papers on Bayesian optimization (BO) adapted statistical modeling and design principles to tackle this optimization problem [Moˇckus, 1975, Jones et al., 1998]. Applications on physics-based simulators f are provided by Pourmohamad and Lee [2021]; scenarios in machine learning are reviewed by Garnett [2022]. BO is common in studies of engagement and user experiences in online platforms [e.g., Letham and Bakshy, 2019], hyperparameter estimation for deep learning [e.g. Turner et al., 2021, Feurer et al., 2018] and materials design [e.g. Zhang et al., 2020b], to name a few.

To illustrate BO and introduce our contributions, consider f(x) = sin(x) and B = [ 1, 2 + 1] evaluated at six equally-spaced inputs x. Next ﬁt a surrogate ˆfn, to data (Xn, Yn), where yi = f(xi), for i = 1, . . . , n = 6. We privilege Gaussian process (GP) based-surrogates [e.g., Gramacy, 2020],

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

1.0 0.0 0.5 1.0 1.5

0.00 0.05 0.10 0.15 0.20 0.25

Expected Improvement Surface

Figure 1: Predictive surface (left) via means (solid black), 90% intervals (dashed-red), and sample paths (gray); EI (right) and resulting acquisition (green diamond).

but our methodology is agnostic to that choice so long as the predictive equations from ˆfn have similar features e.g., non-linear predictive mean µn(x) and higher predictive uncertainty/standard deviation σn(x) away from the training data sites.

The left panel of Figure 1 shows a ﬁtted ˆf6 via µ6 as solid black line with error bars µ6 2σ6 in dashed-red. Each of the one-hundred gray lines is a sample from the predictive distribution. This distribution interpolates the training data (black dots), a GP hallmark, yet our contributions are not limited to noise free settings. (We entertain a noisy f in Section 3.2.) The current best run f min

n = min{y1, . . . , yn} is near x = 5, but there is considerable predictive uncertainty about other parts of the input space B, relative to f min

n . Locations in [ 1, 0] [ [4, 6] [ [2 , 2 + 1] are promising.

BO acquisition criteria serve to operationalize this notion. We shall focus on two and note that several others are variations on similar themes. One is Thompson sampling [TS; Thompson, 1933], and involves drawing from the predictive distribution Y (x) ˆfn | Xn, Yn and choosing xn+1 = argminx2B Y (x). TS amounts to randomly selecting one of those gray lines and optimizing it in lieu of the expensive f, so the criterion is inherently stochastic. The other is expected improvement [EI; Jones et al., 1998]. Deﬁne improvement as I(x) = min{0, f min

n Y (x)} and take its expectation with respect to Y (x) . If Y (x) ˆfn is Gaussian, as with GPs, then this has closed form:

EI(x) = E{I(X)} =

I(x) d Y (x) = (f min

n µn(x)) Φ(zn(x)) + σn(x) φ(zn(x)), (2)

where zn(x) = (f min

n µn(x))/σn(x) and Φ/φ are the Gaussian cdf/pdf. For non-Gaussian ˆfn, one can always resort to Monte Carlo (MC) integration instead. The right panel of Figure 1 provides EI for ˆf6. Contrary to TS, the EI acquisition xn+1 = argmaxx2B EI(x) can be deterministic, if properly solved. The maximum EI acquisition x7 is shown as a green diamond on the left panel.

Observe that TS and EI involve inner-optimizations over a criterion, a task that may be even more challenging than the original problem (1). Each gray line is highly multi-modal, as is the EI surface. Both have about as many local optima as there are training data points n, whereas f has only two local minima in B. Speedy evaluation of µn(x) and σn(x) relative to f(x) saves us, but only partly. We still desire good xn+1 without exhaustive search or cumbersome subroutines.

Two strategies are common, sometimes separately, sometimes in tandem as a hybrid. The simplest option is to distribute N candidate points XN throughout the input space, evaluate the criterion on XN, and thereby replace a continuous search with a discrete one. In low input dimension a dense grid of candidates can be used effectively. In higher dimension one can populate XN with a space-ﬁlling design like a random Latin hypercube sample [LHS; Mckay et al., 1979] to manage the computational expense of evaluating the criterion exhaustively. A higher-powered approach is to locally apply a smooth, convex optimization library such as L-BFGS-B [Byrd et al., 2003]. Derivatives may be approximated by ﬁnite-differencing or autograd [Paszke et al., 2017], or may have a simple closed form depending on the criterion and nature of ˆfn. The former requires more computer work and some consideration of numerical stability; the latter more researcher/programmer effort when possible. (MC-based ˆfn or EI challenges both approaches.) Pure candidate-based search is most common with

TS because of its stochastic nature. Gradient-based continuous search is popular in simple GP/EI setups, but a multi-start scheme is essential to avoid inferior local solutions. This is where the hybrids come in: candidates seeding local solvers.

In this paper we contend that both strategies, random candidates and multi-start local optimization, can be replaced by (or hybridized with) more thoughtfully chosen XN. In the 1d setting of Figure 1 we could place candidates at the midway points between each of the existing n inputs and the boundary B, implementing a kind of multi-pronged bisection search [Burden and Faires, 1985, Section 2.1] and resulting in N = n + 1 = 7 candidates. The best of those XN by either criterion may not give a precise solution to the inner optimization, but it would be an effective one because EI and many of the random gray lines indicate a solution close to one of those midway points. Such locations might be much better than ones identiﬁed by a limited candidate or numerical local search.

This 1d example is overly simplistic. Going forward, we shall explicitly target two and higher dimensions. In Section 2 we scale-up the midway candidate idea, also suggested by Scott et al. [2011], to what we call tricands , based on Delaunay triangulation and the convex hull of Xn. We explore tricands features and limitations and suggest remedies with the BO application in mind. In Section 3 we demonstrate that tricands outperform both random LHS candidates and a multi-start gradient-based numerical inner optimization in the conventional setting where surrogate GP predictive equations are available in closed form. In Section 4 we consider two nonstationary surrogates requiring Markov chain Monte Carlo (MCMC), for which closed-form acquisition criteria are not readily available. Candidates are essential in this setting, and our tricands are better than random space-ﬁlling ones. Our discussion in Section 5 emphasizes tricands plug-n-play nature they can be inserted into any candidate-based scheme and suggests potential for extension.

2 Delaunay triangulation candidates

Many criteria for BO resemble Eq. (2), balancing exploitation (µn(x) below f min

n ) with exploration (large σn(x)). Most surrogates inﬂate predictive uncertainty (σn(x)) away from training data locations Xn. In the case of GPs, this is what produces the sausage shaped predictive intervals shown as red-dashed lines in Figure 1. Our main insight is that careful allocation of candidates between existing training data locations, where σn(x) is high, allows for BO acquisitions that do not necessitate cumbersome numerical optimization of posterior predictive quantities but still balance exploitation and exploration. A hard statistical optimization can be replaced with an easier geometric one.

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

design cands

triangles fringe

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

random selections

Figure 2: Interior and fringe candidates (both), and randomly sub-sampled candidates (right).

The idea is sketched graphically in Figure 2, whose details will emerge over the next three subsections. The existing design Xn is shown as open circles, and the selected candidates are closed red dots and the tips of dashed arrows. We call these interior (Section 2.1) and fringe (Section 2.2) candidates, respectively. Both are calculated based on Delaunay triangles and their convex hull, outlined by solid black lines in the ﬁgure, and explained next. We provide this illustration in 2d to ease visualization; however our Section 3 4 benchmarks include higher dimensional input spaces.

2.1 Interior candidates

A Delaunay triangulation of Xn is an angle-maximizing set of d line-segments connecting geographically nearby points, such that no point lies inside the circumcircle of those points. Such triangulations subdivide the interior of the convex hull of Xn, which is the the smallest (convex) set that contains all points. In 2d, lines depicting those subdivisions form triangles. In higher dimension they form tetrahedra, however one often abuses the nomenclature and still refers to triangles. For more details, see, e.g., Lee and Schachter [1980]. The solid lines of Figure 2 indicate a triangulation for random Xn (left) and an Xn derived after BO iterations (right), described in Section 3 .

There are fast algorithms for calculating Delaunay triangulations. In 2d, these have O(n log n) runtimes. Higher dimensional analysis is complicated by the number of triangles, which depends on the geometry of Xn, a topic we shall return to shortly. For R we use the geometry package on CRAN [Habel et al., 2019]; for Python we use Delaunay in scipy.spatial [Virtanen et al., 2020]. Both are wrappers around the C library Qhull, implementing quickhull [Barber et al., 1996]. Bates and Pronzato [2001] ﬁrst suggested Delaunay triangulation for BO. That early work only explored two input dimensions, possibly because they did not have convenient access to Qhull. They also did not entertain the BO-speciﬁc extensions we provide here, particularly in Sections 2.2 2.3.

Let the triangles be denoted by Tj, for j = 1, . . . , n T . Each Tj is a (d + 1) d matrix when Xn has d columns. Create n T new candidates Xn T where the jth candidate is located at the barycenter of Tj:

xj = Tj = 1 d + 1

Tj[i, ] or xjk = 1 d + 1

Tj[i, k], for k = 1, . . . , d.

The left expression is vectorized over the second, column dimension of Tj. The second is explicit about coordinates x>

j = ( xj1, . . . , xjd). Red dots in Figure 2 provide a visual. These locations will almost certainly not be the maximal points of σn(x) in the vicinity of Tj, but they will be close because xj is within Tj but far from its edges. We refer to these Xn T as interior candidates.

In 2d, Euler s formula gives that n T = 2n 2 h(Xn) where h(Xn) is the number of elements of Xn on its convex hull. In Figure 2, n = 10 and h(Xn) = 6 so n T = 12. When d 3, the number of faces of the tetrahedra can grow as ndd/2e depending on the nature of Xn. Figure 3 provides

2 3 4 5 6 7 8

input dimension

log_10 num cands

n = 10 n = 20 n = 50 n = 100 n = 1000 internal

2 3 4 5 6 7 8

3 2 1 0 1 2

input dimension

log_10 elapsed seconds

Figure 3: Tricands N (left) over input dimension d and design size n, and compute time (right) using a single core of an Intel i7-6900K CPU at 3.20GHz.

an empirical view of the number of candidates (left) and triangulation compute time (right), for varying d and n where Xn is distributed uniformly at random in B = [0, 1]d. The dashed lines in the left panel are log10 n T . Notice that when n is small, n T decreases as d increases. For ﬁxed d, n T steadily increases with n. In our BO context this is a good thing, because so too does the modality of acquisition criterion. With nonparametric surrogates like GPs, the complexity of the response surface can increase with training data size n. The search effort for the next acquisition should be commensurate with this complexity, an innate characteristic of (interior) tricands. The right panel of Figure 2, described later in Section 2.3, shows how this works after many iterations of EI acquisition.

2.2 Fringe candidates

Having candidates Xn T only in the interior of the convex hull could limit exploration if the complement of the volume of the hull and B is large. It could limit exploitation if the solution is on/near the boundary of B. One remedy is to force Xn to contain boundary points, such as the corners of the input space. This is not a bad approach in low dimension (e.g., in 2d there are only four corners), but could be prohibitive in higher dimension. Later in Section 3 we consider a d = 8 example, having 256 corners in B = [0, 1]8, which would nearly blow our entire budget of runs.

We instead prefer the fringe candidates pointed to by the dashed arrows in Figure 2. There is one of these for each facet (edge in 2d) of the convex hull, extending perpendicularly from middle of the facet half way to the boundary of B = [0, 1]d. The Qhull library furnishes both ingredients: d d facets Fj and normal d-vectors ~vj for j = 1, . . . , n F . Using these quantities, let Fj = 1

i=1 Fj[j, ] denote the coordinates of the middle of facet j. Then, the jth fringe candidate in Xn F is

~xj = Fj + 1

2 j~vj where j = min

I{~vj>0} Fj~vj

Above, min is picking out the nearest boundary to Fj. Division by two mirrors the bisection search analogy for interior candidates, but this could be a tuning parameter. Non-unit rectangular B is doable, but requires a more convoluted formula. More general B may present challenges.

Fringe and interior candidates may be combined: XN = [Xn T ; Xn F ], stacked row-wise to form an N d matrix with N = n T + n F . The number of fringe candidates n F is generally small compared to n T . This is shown empirically in the left panel of Figure 3, where N is indicated by the solid line, and n T by the dashed one. As n increases, the gap between N and n T , indicating n F , narrows.

2.3 Targeted sub-sampling

Having N grow exponentially in n may not suit all applications. Entertaining N 10,000 candidates when n = 100 and d = 6, referring to Figure 3, is cumbersome and potentially overkill. One way to limit N is random sub-sampling: simply calculate the full XN based on Xn and downsample, uniformly at random, a subset of size Nsub < N. Since our triangulation strategy was designed to focus on exploration (ﬁnding locally high σn(x)) via midway candidates, we ﬁnd it advantageous to guarantee retaining some of those candidates which are promising for exploitation (low µn(x)), still without any explicit numerical optimization only geometry.

One of the rows of Xn corresponds to f min

n , the best input found so far: xmin

n = xi s.t. i = argmini=1,...,n yi (or µn(xi) in the noisy case). Let T min

n = {Tj : xmin

n 2 Tj, j = 1, . . . , n T } denote the set of triangles containing xmin

n , and similarly let X min

n = { xj 2 XN : Tj 2 T min

n } denote the candidates associated with those triangles. Those are, in a geometric sense, adjacent to xmin

n . A fringe candidate may also be considered adjacent in an analogous way, however we place them in the complement XN \ X min

n . Now, rather than sub-sample uniformly from the full set XN, we partition sampling from points adjacent to xmin

n , i.e., from X min

n , and from points farther aﬁeld in XN \ X min

n . In so doing, we guarantee that our Nsub candidates cover potential for exploitation and exploration, respectively. We prefer a 10:90 split, with up to 10% of Nsub coming from candidates adjacent to xmin

n , fewer if |X min

n | < Nsub, and likewise 90% from its complement.

The right panel of Figure 2 provides an illustration after n = 30 runs optimizing the Goldstein Price function with EI under random initialization (details in Section 3.1). When n = 30 we have N 60, with the precise value depending on h(Xn). Here we consider Nsub = 30. Observe how randomly sub-sampled candidates XNsub, circling the original red candidates XN in green, concentrate near the global minimum (0.5, 0.25) because xmin

n is in the vicinity. Other sub-sampled candidates are spread out more widely. This ﬁgure illustrates how acquisitions, and thus candidates, gravitate toward promising regions for exploitation without neglecting areas of potential exploration. A ridge of local minima may be found in the banana-shaped region traced out by a concentration of Xn and XN.

2.4 Implementation and software

Our implementation, provided for Python and R in our git repository,1 is relatively tidy. For example, tricands.R therein contains just 71 lines of code, eleven of which are to support optional

1http://bitbucket.org/gramacylab/tricands

visualizations in 2d such as those in Figure 2. The heavy lifting is done by Qhull. When performing Delaunay triangulations, we provide option "Q12" to work around numerical instabilities that sometimes arise. When calculating convex hulls, option "n" returns normal vectors used in calculating fringe candidates (Section 2.2).

Defaults yield both fringe and interior candidates and ﬁx a maximum candidate size of max = Nsub = 100d, but these are user-adjustable. In experiments coming shortly, we deliberately limit Nsub even further so that a fairer comparison can be made to other continuous search and candidate-based methods. When N Nsub, all interior and fringe candidates are returned. The user may supplement these with additional Nsub N random candidates if desired. If Nsub < N, the execution ﬂow looks for a variable best, providing the index of xmin

n in Xn, making sure about 10% of tricands come from adjacent triangles. When best=NULL, tricands are sub-sampled at random.

3 Classical GP benchmarking

In this ﬁrst of two sections on benchmarking, we focus on BO via traditional GP surrogates. We follow a homework problem in Section 7.4 of Gramacy [2020], which piggy-backs off of GP, EI, and TS demonstrations earlier in the chapter. Software and other particulars are relegated to Appendix A.1. All of our examples are fully reproducible using the code provided in our git repository. We consider three methods for solving EI acquisitions: a continuous search of the criterion via L-BFGS-B with 5-multi-starts, LHS candidates, and tricands. For TS acquisitions, we similarly employ both LHS candidates and tricands. As non-BO (not surrogate-based) comparators, serving primarily as benchmarks, we entertain raw L-BFGS-B and Nelder Mead [Nelder and Mead, 1965].

Two examples are showcased here, with a third example relegated to Appendix A.2. Our synthetic fs, including those in Section 4, are described in more detail on the pages of the Virtual Library for Simulation Experiments [VLSE; Surjanovic and Bingham, 2013]. In all of our experiments we code inputs to [0, 1]d and summarize results for 100 random restarts where each surrogate is initialized with the same (unique to each random restart) starting design of size n0 = 12, except in Section 3.2 where we use n0 = 60. This design is taken uniformly at random following the advice of Zhang et al. [2021] who caution that small space-ﬁlling initial designs can spark pathological behavior in BO. We track best observed value (BOV) f min

n as a measure of progress which is summarized by median over n = 1, . . . , nend and by boxplots for particular n along the way.

3.1 Goldstein Price

The Goldstein Price function is a popular low-dimensional (2d) benchmark for BO [e.g., Picheny et al., 2012]. A total of nend = 50 acquisitions are entertained, and all candidate-based methods (tricands and LHS) are limited to ﬁfty candidates. This means max = Nsub = 50 for tricands, with fewer candidates when N < Nsub. The top row of Figure 4 summarizes results in three views. In the top-left panel, median progress in BOV is shown over increasing budgets of evaluations (n) as if each subsequent acquisition were the last. In the top-middle and right panels, boxplots capture the distribution of BOV at n = 30 and n = 50, respectively. In the top-left panel, methods based on a multi-start numerical local search use solid lines; those based on tricands are dashed; those based on LHS candidates are dotted. (Nelder Mead is an exception, being red-dashed.) In the boxplots, tricands use slightly heavier ink so that they stand out. Text printed at the top of the top-right panel indicates the average number of times the criterion, EI or TS, was evaluated in solving the inner optimization sub-problem(s), cumulative over all acquisitions. In the case of a numerically optimized EI (labeled EI ), these happen within iterations of L-BFGS-B search.

Beginning with medians over n in the top-left panel, observe that the dashed lines (those based on tricands) are uniformly superior to all other comparators. EI with tricands is slightly better than TS with tricands. Only at the very end of the run does the raw Nelder Mead alternative look competitive. At the 30th evaluation (top-middle) the boxplots indicate that there are some MC repetitions where BOV based on tricands are underperforming. However, more than half of the time these are the best two (i.e., both EI and TS with tricands) of all. TS with tricands is the winner in terms of best-case performance, whereas the EI version has slightly better worst-case behavior. Neither local optimizer (red) is competitive. Finally, in the top-right panel we see that the story is similar, except that now EI with tricands edges out its TS analog. Although median performance for Nelder Mead is competitive, more than half of the time its BOV at nend = 50 is among the two worst in the experiment.

0 10 20 30 40 50

3.0 2.5 2.0 1.5 1.0 0.5 0.0

Goldstein Price, no noise

n: blackbox evaluations

median best observed value

Nelder Mead BFGS

EI EI tri EI lhs TS tri TS lhs

3.0 2.5 2.0 1.5 1.0

30th evaluation

best observed value

final evaluation

# of candidates/criteria evals 9021 1690 1900 1690 1900

0 50 100 150 200 250 300

2.5 2.0 1.5 1.0 0.5

Assemble to Order, input dependent noise

n: blackbox evaluations

median (estimated) best observed value

EI EI tri EI hyb

2.6 2.4 2.2 2.0

150th evaluation

(estimated) best observed value

final evaluation

Figure 4: Goldstein Price (top) and ATO (bottom) BOV over 100 MC trials: median (left panel) and distribution at intermediate (middle) and ﬁnal (right) acquisitions. Sufﬁxes -tri or -lhs denote a candidate search. Absence of a sufﬁx (e.g., EI alone) indicates multi-start L-BFGS-B.

Perhaps the most striking result from the ﬁgure is that tricands-based EI ( EI-tri ) outperforms its multi-start numerical analog ( EI ) despite a factor of ﬁve fewer evaluations of the EI criterion (about 9000 compared to 1700). A more aggressive, gradient-based search misutilizes computational resources. Finding a precise solution ultimately furnishing a maximal local value in an inferior domain of attraction may come at the expense of ﬁnding an accurate approximation near a global solution. The same is true, but to a lesser extent, when comparing TS variations (5% reduction).

See Appendix A.2 for nearly identical results with the higher dimensional Hartmann 6 function.

3.2 Assemble to order (ATO)

The assemble-to-order (ATO) simulator [Hong and Nelson, 2006] deploys a queuing system virtualizing a manufacturer receiving orders for a variety of widgets, each with different demand for component parts. Target inventory levels for eight parts comprise the controllable inputs (d = 8). The output is proﬁt, a scalar distilling inventory and order fulﬁllment costs under random orders for parts, fulﬁllment and inventory replenishment times. ATO is coded in Matlab and exhibits input-dependent noise. Although it is reasonably fast (seconds per evaluation), relatively large n is required to separate signal from noise (n0 = 80 and nend = 300). Cubic computational bottlnecks for GP inference are in play here, despite using a thrifty heteroskedastic GP [Binois et al., 2018]. More details, including software, are provided in Appendix A.1. Since our Matlab licenses prevented us from running parallel instances, we had to limit our experiment somewhat to obtain results in a reasonable amount

of time. Therefore we only entertained three acquisition alternatives: random multi-start numerically optimized EI, tricands with the default Nsub = 100d = 800, and a hybrid using the best tricands candidate to initialize a local numerically optimized EI ( EI-hyb ).

The bottom panels of Figure 4 track BOV for these three alternatives in a view similar to the panels above. Note that the original ATO problem involves maximization, so we have negated the output and also scaled it so that its units were more similar to our other benchmark problems. Since the output is random, the BOVs are estimates. We used the surrogate ﬁt at nend to assign in-sample predictions via µnend(xi). Otherwise, the story is similar to our earlier synthetic results. Tricands progress is initially slower than numerically optimized EI, but eventually gives lower BOV values. The difference at nend is slight, but statistically signiﬁcant. A paired Wilcox test with alternative hypothesis that BOV for EI-tri is below EI has a p-value of 0.0037. The hybrid is even better, beating pure tricands and pure numerical optimization 85 and 88 out of 100 times, respectively.

4 Sampling-based surrogates

Tricands are most valuable when the inner-optimization problem cannot be solved by library-based numerical methods, even locally. This happens when the surrogate predictive surface is discontinuous and/or when inference requires MCMC. Our examples here involve response surfaces exhibiting nonstationarity, meaning that the input output dynamics evolve over the input space. This demands a more elaborate surrogate. We consider two. A treed Gaussian process [TGP; Gramacy and Lee, 2008] uses axis-aligned partitioning with GPs. Such divide-and-conquer excels when dynamics change abruptly across individual inputs creating distinct regimes. MCMC can average over the location of probable partition boundaries. We use the R package tgp on CRAN [Gramacy, 2007], following Section 4 of Gramacy and Taddy [2010] for candidate-based EI acquisition.

Deep Gaussian processes [DGPs; Damianou and Lawrence, 2013] warp inputs to accommodate a more subtle evolution of dynamics in the input space compared to the abrupt regime changes of TGP. Although variational inference is popular for DGPs [Salimbeni and Deisenroth, 2017], Sauer et al. [2020] argue that in active learning contexts, such as BO, full posterior integration leads to better uncertainty quantiﬁcation, and thus better acquisitions. Here we use the R package deepgp on CRAN [Sauer, 2021] which supports EI evaluation on candidates.

Unfortunately, neither tgp nor deepgp facilitate TS acquisition for BO. However, the tgp package furnishes a maximum a posteriori sample which can be used as the basis of a local search [Gramacy and Taddy, 2010, Section 4]. Here we show that tricands provides better candidates for both pure candidate EI search and this hybrid scheme.

4.1 Abrupt changes

The Gramacy & Lee (G&L) function beneﬁts from a model supporting hard breaks even though its dynamics evolve smoothly. The 2d input domain (coded to [0, 1]2) is mostly ﬂat except in one quadrant where a local maximum is twinned with a (global) minimum. The domain of attraction of that global minimum covers only about 10% of the input space. It is easily missed by random initial designs and candidate sets. TGP is able to isolate the interesting quadrant with just two axis aligned partitions, thus recognizing that sampling effort should be concentrated there.

The top row of Figure 5 shows results in the same three views as earlier. The number of candidates is limited to twenty, otherwise the setup is unchanged. First ignore the red lines and boxplots (those labeled hyb- ) and focus on the black (TGP) and green (DGP). Notice that in both cases the tricands-based comparators, dashed and/or bolded, outperform their LHS-based analogues. In terms of medians over n (top-left panel), that dominance is uniform. In terms of boxplots, the disparity is stark with the exception of a few outlying DGP-based BOV values with tricands. TGP seems to edge out DGP (with tricands) which we attribute to the abrupt change between the interesting quadrant and the rest of the input space. Now focus on the red hybresults. These are the ones where TGP candidate-based search is ﬁnished with a gradient-based EI search on the maximum a posteriori model. Again, tricands wins. To supplement the visuals, we report that the tricands hybrid BOV value was lowest in 97/100 reps (top-right).

A higher dimensional example using the Michaelwicz function is provided in Appendix A.3.

0 10 20 30 40 50

0.4 0.3 0.2 0.1 0.0

Gramacy & Lee, no noise

n: blackbox evaluations

median best observed value

TGP lhs TGP tri hyb lhs hyb tri DGP lhs DGP tri

0.42 0.40 0.38 0.36 0.34 0.32 0.30

30th evaluation

best observed value

final evaluation

0.05 0.10 0.15 0.20

Levy 5d, no noise

n: blackbox evaluations

median best observed value

TGP lhs TGP tri hyb lhs hyb tri DGP lhs DGP tri

0.00 0.05 0.10 0.15

50th evaluation

best observed value

final evaluation

Figure 5: Gramacy & Lee (top) and 5d Levy (bottom) with MCMC-based comparators.

4.2 Smooth changes

The Levy function is deﬁned in arbitrary dimension. It is rufﬂed with many peaks and valleys. Here we consider d = 5 for variety; see the bottom row of Figure 5. Commensurate with our other examples, we consider two hundred candidates and nend = 75.

Both candidate DGP variations outperform their TGP counterparts. Slowly/smoothly varying dynamics favor warping inputs as opposed to hard partitioning. This is evident in both median and full distribution (boxplot) views. In the median view (bottom-left), notice that the dashed lines are uniformly under the solid ones of the same color. Tricands are providing better median performance over all n. Although the preeminence of tricands is apparent visually, we report that it provided better BOV in 81/100 MC restarts among the best (DGP) comparators. When tricands are involved, the hybrid candidate/numerical option is not discernibly better than the pure candidate alternative.

5 Discussion

We offer a novel take on space-ﬁlling candidates for acquisition in BO. The idea is to ﬁll the spaces in-between previous acquisitions. This is motivated by an analogy to bisection search, but also by the nature of GP predictive surfaces which often serve as surrogates in BO. GP surrogates have organically inﬂated uncertainty between training data sites, which makes those spots attractive for BO acquisition. This notion is extended to the space between the convex hull of existing training data and the boundary B of the study region. In an array of benchmark exercises we have demonstrated that these tricands lead to superior performance in BO compared to both higher-powered gradient-based

acquisition schemes and simpler space-ﬁlling candidates. Tricands main attraction is that they mimic the behavior of higher-powered searches with the implementation simplicity of candidates. That simplicity means that tricands may be deployed where the higher-powered alternatives cannot, such as when the surrogate is not continuous or requires MCMC.

Additional discussion on high input dimension and other extensions is provided in Appendix B.

Acknowledgments and Disclosure of Funding

We are grateful for funding from DOE LAB 17-1697 via subaward from Argonne National Laboratory for Sci DAC/DOE Ofﬁce of Science ASCR and High Energy Physics, and for comments from two teams of anonymous referees which greatly improved the paper.

Dario Azzimonti, David Ginsbourger, Clément Chevalier, Julien Bect, and Yann Richet. Adaptive design of

experiments for conservative estimation of excursion sets. Technometrics, pages 1 14, 2020.

Evan Baker, Pierre Barbillon, Arindam Fadikar, Robert B Gramacy, Radu Herbei, David Higdon, Jiangeng

Huang, Leah R Johnson, Pulong Ma, Anirban Mondal, et al. Analyzing stochastic computer models: A review with opportunities. ar Xiv preprint ar Xiv:2002.01321, 2020.

C. Bradford Barber, David P. Dobkin, and Hannu Huhdanpaa. The quickhull algorithm for convex hulls. ACM

Trans. Math. Softw., 22(4):469 483, December 1996. ISSN 0098-3500. doi: 10.1145/235815.235821. URL

https://doi.org/10.1145/235815.235821.

R. Bates and L. Pronzato. Emulator-based global optimisation using lattices and delaunay tesselation. In

Sensitivity Analysis of Model Output, pages 189 192, Madrid (Espagne), 2001.

Julien Bect, David Ginsbourger, Ling Li, Victor Picheny, and Emmanuel Vazquez. Sequential design of computer

experiments for the estimation of a probability of failure. Statistics and Computing, 22(3):773 793, 2012.

Henrik Bengtsson. R.matlab: Read and Write MAT Files and Call MATLAB from Within R, 2018. URL

https://CRAN.R-project.org/package=R.matlab. R package version 3.6.2.

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization.

Advances in neural information processing systems, 24, 2011.

M Binois, RB Gramacy, and M Ludkovski. Practical heteroscedastic Gaussian process modeling for large

simulation experiments. Journal of Computational and Graphical Statistics, 27(4):808 821, 2018. doi: 10.1080/10618600.2018.1458625. URL https://doi.org/10.1080/10618600.2018.1458625.

M Binois, J Huang, RB Gramacy, and M Ludkovski. Replication or exploration? Sequential design for stochastic

simulation experiments. Technometrics, 27(4):808 821, 2019. doi: 10.1080/00401706.2018.1469433. URL https://doi.org/10.1080/00401706.2018.1469433.

Mickael Binois and Nathan Wycoff. A survey on high-dimensional Gaussian process modeling with application

to bayesian optimization, 2021.

Mickaël Binois and Robert B. Gramacy. hetgp: Heteroskedastic Gaussian process modeling and sequential

design in r. Journal of Statistical Software, 98(13):1 44, 2021. doi: 10.18637/jss.v098.i13. URL https: //www.jstatsoft.org/index.php/jss/article/view/v098i13.

Leo Breiman. Random forests. Machine learning, 45(1):5 32, 2001.

AD Bull. Convergence rates of efﬁcient global optimization algorithms. Journal of Machine Learning Research,

12(Oct):2879 2904, 2011.

R.L. Burden and J.D. Faires. Numerical Analysis. PWS Publishers, 1985.

Richard Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained

optimization. SIAM Journal on Scientiﬁc Computing, 16, 02 2003. doi: 10.1137/0916069.

Rob Carnell. lhs: Latin Hypercube Samples, 2018. URL https://CRAN.R-project.org/package=lhs. R

package version 0.16.

Clément Chevalier and David Ginsbourger. Fast computation of the multi-points expected improvement with

applications in batch selection. In International Conference on Learning and Intelligent Optimization, pages 59 69. Springer, 2013.

Clément Chevalier, Julien Bect, David Ginsbourger, Emmanuel Vazquez, Victor Picheny, and Yann Richet. Fast

parallel kriging-based stepwise uncertainty reduction with application to the identiﬁcation of an excursion set. Technometrics, 56(4):455 465, 2014.

D Austin Cole, Robert B Gramacy, James E Warner, Geoffrey F Bomarito, Patrick E Leser, and William P Leser.

Entropy-based adaptive design for contour ﬁnding and estimating reliability. ar Xiv preprint ar Xiv:2105.11357, 2021.

Andreas Damianou and Neil D Lawrence. Deep Gaussian processes. In Artiﬁcial intelligence and statistics,

pages 207 215. PMLR, 2013.

Samuel Daulton, David Eriksson, Maximilian Balandat, and Eytan Bakshy. Multi-objective Bayesian optimiza-

tion over high-dimensional search spaces, 2021.

David Eriksson, Michael Pearce, Jacob Gardner, Ryan D Turner, and Matthias Poloczek. Scalable global

optimization via local Bayesian optimization. In Advances in Neural Information Processing Systems, pages 5497 5508, 2019.

Matthias Feurer, Benjamin Letham, Frank Hutter, and Eytan Bakshy. Practical transfer learning for bayesian

optimization. ar Xiv preprint ar Xiv:1802.02219, 2018.

Peter I Frazier, Warren B Powell, and Savas Dayanik. A knowledge-gradient policy for sequential information

collection. SIAM Journal on Control and Optimization, 47(5):2410 2439, 2008.

Roman Garnett. Bayesian Optimization. Cambridge University Press, 2022. in preparation.

David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. A multi-points criterion for deterministic parallel

global optimization based on kriging. In NCP07, 2007.

J Gonzalez, M Osborne, and N Lawrence. GLASSES: Relieving the myopia of Bayesian optimisation. In

Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, pages 790 799, 2016.

Javier González, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch bayesian optimization via local

penalization. In Artiﬁcial intelligence and statistics, pages 648 657. PMLR, 2016.

RB Gramacy. tgp: an R package for Bayesian nonstationary, semiparametric nonlinear regression and design by

treed Gaussian process models. Journal of Statistical Software, 19(9):6, 2007.

RB Gramacy. la GP: large-scale spatial modeling via local approximate Gaussian processes in R. Journal of

Statistical Software, 72(1):1 46, 2016.

RB Gramacy and Herbert KH Lee. Optimization under unknown constraints. In Bayesian Statistics, volume 9.

Oxford University Press, 2011.

RB Gramacy and HKH Lee. Bayesian treed Gaussian process models with an application to computer modeling.

Journal of the American Statistical Association, 103(483):1119 1130, 2008.

RB Gramacy and F Sun. la GP: Local Approximate Gaussian Process Regression, 2018. URL http://bobby.

gramacy.com/r_packages/la GP. R package version 1.5-3.

RB Gramacy and MA Taddy. Categorical inputs, sensitivity analysis, optimization and importance tempering

with tgp version 2, an R package for treed Gaussian process models. Journal of Statistical Software, 33(6): 1 48, 2010.

Robert B. Gramacy. Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences.

Chapman Hall/CRC, Boca Raton, Florida, 2020. http://bobby.gramacy.com/surrogates/.

Kai Habel, Raoul Grasman, Robert B. Gramacy, Pavlo Mozharovskyi, and David C. Sterratt. geometry: Mesh

Generation and Surface Tessellation, 2019. URL https://CRAN.R-project.org/package=geometry. R package version 0.4.5.

L.J. Hong and B.L. Nelson. Discrete optimization via simulation using compass. Operations Research, 54(1):

115 129, 2006.

Donald R Jones, Matthias Schonlau, and William J Welch. Efﬁcient global optimization of expensive black-box

functions. Journal of Global optimization, 13(4):455 492, 1998.

Remi Lam, Karen Willcox, and David H Wolpert. Bayesian optimization with a ﬁnite budget: An approximate

dynamic programming approach. Advances in Neural Information Processing Systems, 29:883 891, 2016.

ER Leatherman, TJ Santner, and AM Dean. Computer experiment designs for accurate prediction. Statistics and

Computing, pages 1 13, 2017.

Der-Tsai Lee and Bruce J Schachter. Two algorithms for constructing a delaunay triangulation. International

Journal of Computer & Information Sciences, 9(3):219 242, 1980.

Benjamin Letham and Eytan Bakshy. Bayesian optimization for policy search via online-ofﬂine experimentation.

J. Mach. Learn. Res., 20:145 1, 2019.

Alexandre Marques, Remi Lam, and Karen Willcox. Contour location via entropy reduction leveraging multiple

information sources. In Advances in Neural Information Processing Systems, pages 5217 5227, 2018.

D Mckay, Richard Beckman, and William Conover. A comparison of three methods for selecting vales of input

variables in the analysis of output from a computer code. Technometrics, 21:239 245, 05 1979.

Jonas Moˇckus. On bayesian methods for seeking the extremum. In Optimization techniques IFIP technical

conference, pages 400 404. Springer, 1975.

JA Nelder and R Mead. A simplex method for function minimization. The Computer Journal, 7(4):308 313,

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin,

Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. Neur IPS Autodiff Workshop, 2017.

V Picheny, T Wagner, and D Ginsbourger. A benchmark of kriging-based inﬁll criteria for noisy optimization.

Structural and Multidisciplinary Optimization, pages 1 20, 2012.

Tony Pourmohamad and Herbert KH Lee. Bayesian Optimization with Application to Computer Experiments.

Springer, New York, NY, 2021.

Pritam Ranjan, Derek Bingham, and George Michailidis. Sequential experiment design for contour estimation

from complex computer codes. Technometrics, 50(4):527 541, 2008.

Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep Gaussian processes.

ar Xiv preprint ar Xiv:1705.08933, 2017.

Annie Sauer. deepgp: Sequential Design for Deep Gaussian Processes using MCMC, 2021. URL https:

//CRAN.R-project.org/package=deepgp. R package version 0.3.0.

Annie Sauer, Robert B Gramacy, and David Higdon. Active learning for deep Gaussian process surrogates.

ar Xiv preprint ar Xiv:2012.08015, 2020.

Warren Scott, Peter Frazier, and Warren Powell. The correlated knowledge gradient for simulation optimization

of continuous parameters using Gaussian process regression. SIAM Journal on Optimization, 21(3):996 1026, 2011.

Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the

bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

Peter Su and Robert L Scot Drysdale. A comparison of sequential delaunay triangulation algorithms. Computa-

tional Geometry, 7(5-6):361 385, 1997.

S Surjanovic and D Bingham. Virtual library of simulation experiments: test functions and datasets. http:

//www.sfu.ca/~ssurjano, 2013.

MA Taddy, HKH Lee, GA Gray, and JD Grifﬁn. Bayesian guided pattern search for robust local optimization.

Technometrics, 51(4):389 401, 2009.

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence

of two samples. Biometrika, 25(3/4):285 294, 1933.

Ryan Turner, David Eriksson, Michael Mc Court, Juha Kiili, Eero Laaksonen, Zhen Xu, and Isabelle Guyon.

Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. ar Xiv preprint ar Xiv:2104.10201, 2021.

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni

Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, Ilhan Polat, Yu Feng, Eric W. Moore, Jake Vander Plas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientiﬁc Computing in Python. Nature Methods, 17:261 272, 2020. doi: 10.1038/s41592-019-0686-2.

Linnan Wang, Rodrigo Fonseca, and Yuandong Tian. Learning search space partition for black-box optimization

using monte carlo tree search. Advances in Neural Information Processing Systems, 33:19511 19522, 2020.

Boya Zhang, Robert B Gramacy, Leah Johnson, Kenneth A Rose, and Eric Smith. Batch-sequential design and

heteroskedastic surrogate modeling for delta smelt conservation. ar Xiv preprint ar Xiv:2010.06515, 2020a.

Boya Zhang, D Austin Cole, and Robert B Gramacy. Distance-distributed design for Gaussian process surrogates.

Technometrics, 63(1):40 52, 2021.

Yichi Zhang, Daniel W Apley, and Wei Chen. Bayesian optimization for materials design with mixed quantitative

and qualitative variables. Scientiﬁc reports, 10(1):1 13, 2020b.