# triangulation_candidates_for_bayesian_optimization__7a391b6a.pdf Triangulation candidates for Bayesian optimization Robert B. Gramacy Department of Statistics Virginia Tech Blacksburg, VA 24060, USA Annie Sauer Department of Statistics Virginia Tech Blacksburg, VA 24060, USA anniees@vt.edu Nathan Wycoff The Mc Court School s Massive Data Institute Georgetown University Washington DC, 20057, USA nathan.wycoff@georgetown.edu Bayesian optimization involves inner optimization over a new-data acquisition criterion which is non-convex/highly multi-modal, may be non-differentiable, or may otherwise thwart local numerical optimizers. In such cases it is common to replace continuous search with a discrete one over random candidates. Here we propose using candidates based on a Delaunay triangulation of the existing input design. We detail the construction of these tricands and demonstrate empirically how they outperform both numerically optimized acquisitions and random candidate-based alternatives, and are well-suited for hybrid schemes, on benchmark synthetic and real simulation experiments. 1 Introduction We address the continuous unconstrained optimization problem x? = argmin where the bounding box B is a hyperrectangle, often taken as [0, 1]d in coded inputs. The objective f : B ! R is a blackbox function, meaning that we can only learn about its behavior through expensive, often simulation-based, evaluation. Such problems are most challenging when f is highly non-convex, and thus contains multiple local minima. A tacit goal of a solver is to minimize the number of times that f is evaluated in search of a global solution. The earliest papers on Bayesian optimization (BO) adapted statistical modeling and design principles to tackle this optimization problem [Moˇckus, 1975, Jones et al., 1998]. Applications on physics-based simulators f are provided by Pourmohamad and Lee [2021]; scenarios in machine learning are reviewed by Garnett [2022]. BO is common in studies of engagement and user experiences in online platforms [e.g., Letham and Bakshy, 2019], hyperparameter estimation for deep learning [e.g. Turner et al., 2021, Feurer et al., 2018] and materials design [e.g. Zhang et al., 2020b], to name a few. To illustrate BO and introduce our contributions, consider f(x) = sin(x) and B = [ 1, 2 + 1] evaluated at six equally-spaced inputs x. Next fit a surrogate ˆfn, to data (Xn, Yn), where yi = f(xi), for i = 1, . . . , n = 6. We privilege Gaussian process (GP) based-surrogates [e.g., Gramacy, 2020], 36th Conference on Neural Information Processing Systems (Neur IPS 2022). 1.0 0.0 0.5 1.0 1.5 0.00 0.05 0.10 0.15 0.20 0.25 Expected Improvement Surface Figure 1: Predictive surface (left) via means (solid black), 90% intervals (dashed-red), and sample paths (gray); EI (right) and resulting acquisition (green diamond). but our methodology is agnostic to that choice so long as the predictive equations from ˆfn have similar features e.g., non-linear predictive mean µn(x) and higher predictive uncertainty/standard deviation σn(x) away from the training data sites. The left panel of Figure 1 shows a fitted ˆf6 via µ6 as solid black line with error bars µ6 2σ6 in dashed-red. Each of the one-hundred gray lines is a sample from the predictive distribution. This distribution interpolates the training data (black dots), a GP hallmark, yet our contributions are not limited to noise free settings. (We entertain a noisy f in Section 3.2.) The current best run f min n = min{y1, . . . , yn} is near x = 5, but there is considerable predictive uncertainty about other parts of the input space B, relative to f min n . Locations in [ 1, 0] [ [4, 6] [ [2 , 2 + 1] are promising. BO acquisition criteria serve to operationalize this notion. We shall focus on two and note that several others are variations on similar themes. One is Thompson sampling [TS; Thompson, 1933], and involves drawing from the predictive distribution Y (x) ˆfn | Xn, Yn and choosing xn+1 = argminx2B Y (x). TS amounts to randomly selecting one of those gray lines and optimizing it in lieu of the expensive f, so the criterion is inherently stochastic. The other is expected improvement [EI; Jones et al., 1998]. Define improvement as I(x) = min{0, f min n Y (x)} and take its expectation with respect to Y (x) . If Y (x) ˆfn is Gaussian, as with GPs, then this has closed form: EI(x) = E{I(X)} = I(x) d Y (x) = (f min n µn(x)) Φ(zn(x)) + σn(x) φ(zn(x)), (2) where zn(x) = (f min n µn(x))/σn(x) and Φ/φ are the Gaussian cdf/pdf. For non-Gaussian ˆfn, one can always resort to Monte Carlo (MC) integration instead. The right panel of Figure 1 provides EI for ˆf6. Contrary to TS, the EI acquisition xn+1 = argmaxx2B EI(x) can be deterministic, if properly solved. The maximum EI acquisition x7 is shown as a green diamond on the left panel. Observe that TS and EI involve inner-optimizations over a criterion, a task that may be even more challenging than the original problem (1). Each gray line is highly multi-modal, as is the EI surface. Both have about as many local optima as there are training data points n, whereas f has only two local minima in B. Speedy evaluation of µn(x) and σn(x) relative to f(x) saves us, but only partly. We still desire good xn+1 without exhaustive search or cumbersome subroutines. Two strategies are common, sometimes separately, sometimes in tandem as a hybrid. The simplest option is to distribute N candidate points XN throughout the input space, evaluate the criterion on XN, and thereby replace a continuous search with a discrete one. In low input dimension a dense grid of candidates can be used effectively. In higher dimension one can populate XN with a space-filling design like a random Latin hypercube sample [LHS; Mckay et al., 1979] to manage the computational expense of evaluating the criterion exhaustively. A higher-powered approach is to locally apply a smooth, convex optimization library such as L-BFGS-B [Byrd et al., 2003]. Derivatives may be approximated by finite-differencing or autograd [Paszke et al., 2017], or may have a simple closed form depending on the criterion and nature of ˆfn. The former requires more computer work and some consideration of numerical stability; the latter more researcher/programmer effort when possible. (MC-based ˆfn or EI challenges both approaches.) Pure candidate-based search is most common with TS because of its stochastic nature. Gradient-based continuous search is popular in simple GP/EI setups, but a multi-start scheme is essential to avoid inferior local solutions. This is where the hybrids come in: candidates seeding local solvers. In this paper we contend that both strategies, random candidates and multi-start local optimization, can be replaced by (or hybridized with) more thoughtfully chosen XN. In the 1d setting of Figure 1 we could place candidates at the midway points between each of the existing n inputs and the boundary B, implementing a kind of multi-pronged bisection search [Burden and Faires, 1985, Section 2.1] and resulting in N = n + 1 = 7 candidates. The best of those XN by either criterion may not give a precise solution to the inner optimization, but it would be an effective one because EI and many of the random gray lines indicate a solution close to one of those midway points. Such locations might be much better than ones identified by a limited candidate or numerical local search. This 1d example is overly simplistic. Going forward, we shall explicitly target two and higher dimensions. In Section 2 we scale-up the midway candidate idea, also suggested by Scott et al. [2011], to what we call tricands , based on Delaunay triangulation and the convex hull of Xn. We explore tricands features and limitations and suggest remedies with the BO application in mind. In Section 3 we demonstrate that tricands outperform both random LHS candidates and a multi-start gradient-based numerical inner optimization in the conventional setting where surrogate GP predictive equations are available in closed form. In Section 4 we consider two nonstationary surrogates requiring Markov chain Monte Carlo (MCMC), for which closed-form acquisition criteria are not readily available. Candidates are essential in this setting, and our tricands are better than random space-filling ones. Our discussion in Section 5 emphasizes tricands plug-n-play nature they can be inserted into any candidate-based scheme and suggests potential for extension. 2 Delaunay triangulation candidates Many criteria for BO resemble Eq. (2), balancing exploitation (µn(x) below f min n ) with exploration (large σn(x)). Most surrogates inflate predictive uncertainty (σn(x)) away from training data locations Xn. In the case of GPs, this is what produces the sausage shaped predictive intervals shown as red-dashed lines in Figure 1. Our main insight is that careful allocation of candidates between existing training data locations, where σn(x) is high, allows for BO acquisitions that do not necessitate cumbersome numerical optimization of posterior predictive quantities but still balance exploitation and exploration. A hard statistical optimization can be replaced with an easier geometric one. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 design cands triangles fringe 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 random selections Figure 2: Interior and fringe candidates (both), and randomly sub-sampled candidates (right). The idea is sketched graphically in Figure 2, whose details will emerge over the next three subsections. The existing design Xn is shown as open circles, and the selected candidates are closed red dots and the tips of dashed arrows. We call these interior (Section 2.1) and fringe (Section 2.2) candidates, respectively. Both are calculated based on Delaunay triangles and their convex hull, outlined by solid black lines in the figure, and explained next. We provide this illustration in 2d to ease visualization; however our Section 3 4 benchmarks include higher dimensional input spaces. 2.1 Interior candidates A Delaunay triangulation of Xn is an angle-maximizing set of d line-segments connecting geographically nearby points, such that no point lies inside the circumcircle of those points. Such triangulations subdivide the interior of the convex hull of Xn, which is the the smallest (convex) set that contains all points. In 2d, lines depicting those subdivisions form triangles. In higher dimension they form tetrahedra, however one often abuses the nomenclature and still refers to triangles. For more details, see, e.g., Lee and Schachter [1980]. The solid lines of Figure 2 indicate a triangulation for random Xn (left) and an Xn derived after BO iterations (right), described in Section 3 . There are fast algorithms for calculating Delaunay triangulations. In 2d, these have O(n log n) runtimes. Higher dimensional analysis is complicated by the number of triangles, which depends on the geometry of Xn, a topic we shall return to shortly. For R we use the geometry package on CRAN [Habel et al., 2019]; for Python we use Delaunay in scipy.spatial [Virtanen et al., 2020]. Both are wrappers around the C library Qhull, implementing quickhull [Barber et al., 1996]. Bates and Pronzato [2001] first suggested Delaunay triangulation for BO. That early work only explored two input dimensions, possibly because they did not have convenient access to Qhull. They also did not entertain the BO-specific extensions we provide here, particularly in Sections 2.2 2.3. Let the triangles be denoted by Tj, for j = 1, . . . , n T . Each Tj is a (d + 1) d matrix when Xn has d columns. Create n T new candidates Xn T where the jth candidate is located at the barycenter of Tj: xj = Tj = 1 d + 1 Tj[i, ] or xjk = 1 d + 1 Tj[i, k], for k = 1, . . . , d. The left expression is vectorized over the second, column dimension of Tj. The second is explicit about coordinates x> j = ( xj1, . . . , xjd). Red dots in Figure 2 provide a visual. These locations will almost certainly not be the maximal points of σn(x) in the vicinity of Tj, but they will be close because xj is within Tj but far from its edges. We refer to these Xn T as interior candidates. In 2d, Euler s formula gives that n T = 2n 2 h(Xn) where h(Xn) is the number of elements of Xn on its convex hull. In Figure 2, n = 10 and h(Xn) = 6 so n T = 12. When d 3, the number of faces of the tetrahedra can grow as ndd/2e depending on the nature of Xn. Figure 3 provides 2 3 4 5 6 7 8 input dimension log_10 num cands n = 10 n = 20 n = 50 n = 100 n = 1000 internal 2 3 4 5 6 7 8 3 2 1 0 1 2 input dimension log_10 elapsed seconds Figure 3: Tricands N (left) over input dimension d and design size n, and compute time (right) using a single core of an Intel i7-6900K CPU at 3.20GHz. an empirical view of the number of candidates (left) and triangulation compute time (right), for varying d and n where Xn is distributed uniformly at random in B = [0, 1]d. The dashed lines in the left panel are log10 n T . Notice that when n is small, n T decreases as d increases. For fixed d, n T steadily increases with n. In our BO context this is a good thing, because so too does the modality of acquisition criterion. With nonparametric surrogates like GPs, the complexity of the response surface can increase with training data size n. The search effort for the next acquisition should be commensurate with this complexity, an innate characteristic of (interior) tricands. The right panel of Figure 2, described later in Section 2.3, shows how this works after many iterations of EI acquisition. 2.2 Fringe candidates Having candidates Xn T only in the interior of the convex hull could limit exploration if the complement of the volume of the hull and B is large. It could limit exploitation if the solution is on/near the boundary of B. One remedy is to force Xn to contain boundary points, such as the corners of the input space. This is not a bad approach in low dimension (e.g., in 2d there are only four corners), but could be prohibitive in higher dimension. Later in Section 3 we consider a d = 8 example, having 256 corners in B = [0, 1]8, which would nearly blow our entire budget of runs. We instead prefer the fringe candidates pointed to by the dashed arrows in Figure 2. There is one of these for each facet (edge in 2d) of the convex hull, extending perpendicularly from middle of the facet half way to the boundary of B = [0, 1]d. The Qhull library furnishes both ingredients: d d facets Fj and normal d-vectors ~vj for j = 1, . . . , n F . Using these quantities, let Fj = 1 i=1 Fj[j, ] denote the coordinates of the middle of facet j. Then, the jth fringe candidate in Xn F is ~xj = Fj + 1 2 j~vj where j = min I{~vj>0} Fj~vj Above, min is picking out the nearest boundary to Fj. Division by two mirrors the bisection search analogy for interior candidates, but this could be a tuning parameter. Non-unit rectangular B is doable, but requires a more convoluted formula. More general B may present challenges. Fringe and interior candidates may be combined: XN = [Xn T ; Xn F ], stacked row-wise to form an N d matrix with N = n T + n F . The number of fringe candidates n F is generally small compared to n T . This is shown empirically in the left panel of Figure 3, where N is indicated by the solid line, and n T by the dashed one. As n increases, the gap between N and n T , indicating n F , narrows. 2.3 Targeted sub-sampling Having N grow exponentially in n may not suit all applications. Entertaining N 10,000 candidates when n = 100 and d = 6, referring to Figure 3, is cumbersome and potentially overkill. One way to limit N is random sub-sampling: simply calculate the full XN based on Xn and downsample, uniformly at random, a subset of size Nsub < N. Since our triangulation strategy was designed to focus on exploration (finding locally high σn(x)) via midway candidates, we find it advantageous to guarantee retaining some of those candidates which are promising for exploitation (low µn(x)), still without any explicit numerical optimization only geometry. One of the rows of Xn corresponds to f min n , the best input found so far: xmin n = xi s.t. i = argmini=1,...,n yi (or µn(xi) in the noisy case). Let T min n = {Tj : xmin n 2 Tj, j = 1, . . . , n T } denote the set of triangles containing xmin n , and similarly let X min n = { xj 2 XN : Tj 2 T min n } denote the candidates associated with those triangles. Those are, in a geometric sense, adjacent to xmin n . A fringe candidate may also be considered adjacent in an analogous way, however we place them in the complement XN \ X min n . Now, rather than sub-sample uniformly from the full set XN, we partition sampling from points adjacent to xmin n , i.e., from X min n , and from points farther afield in XN \ X min n . In so doing, we guarantee that our Nsub candidates cover potential for exploitation and exploration, respectively. We prefer a 10:90 split, with up to 10% of Nsub coming from candidates adjacent to xmin n , fewer if |X min n | < Nsub, and likewise 90% from its complement. The right panel of Figure 2 provides an illustration after n = 30 runs optimizing the Goldstein Price function with EI under random initialization (details in Section 3.1). When n = 30 we have N 60, with the precise value depending on h(Xn). Here we consider Nsub = 30. Observe how randomly sub-sampled candidates XNsub, circling the original red candidates XN in green, concentrate near the global minimum (0.5, 0.25) because xmin n is in the vicinity. Other sub-sampled candidates are spread out more widely. This figure illustrates how acquisitions, and thus candidates, gravitate toward promising regions for exploitation without neglecting areas of potential exploration. A ridge of local minima may be found in the banana-shaped region traced out by a concentration of Xn and XN. 2.4 Implementation and software Our implementation, provided for Python and R in our git repository,1 is relatively tidy. For example, tricands.R therein contains just 71 lines of code, eleven of which are to support optional 1http://bitbucket.org/gramacylab/tricands visualizations in 2d such as those in Figure 2. The heavy lifting is done by Qhull. When performing Delaunay triangulations, we provide option "Q12" to work around numerical instabilities that sometimes arise. When calculating convex hulls, option "n" returns normal vectors used in calculating fringe candidates (Section 2.2). Defaults yield both fringe and interior candidates and fix a maximum candidate size of max = Nsub = 100d, but these are user-adjustable. In experiments coming shortly, we deliberately limit Nsub even further so that a fairer comparison can be made to other continuous search and candidate-based methods. When N Nsub, all interior and fringe candidates are returned. The user may supplement these with additional Nsub N random candidates if desired. If Nsub < N, the execution flow looks for a variable best, providing the index of xmin n in Xn, making sure about 10% of tricands come from adjacent triangles. When best=NULL, tricands are sub-sampled at random. 3 Classical GP benchmarking In this first of two sections on benchmarking, we focus on BO via traditional GP surrogates. We follow a homework problem in Section 7.4 of Gramacy [2020], which piggy-backs off of GP, EI, and TS demonstrations earlier in the chapter. Software and other particulars are relegated to Appendix A.1. All of our examples are fully reproducible using the code provided in our git repository. We consider three methods for solving EI acquisitions: a continuous search of the criterion via L-BFGS-B with 5-multi-starts, LHS candidates, and tricands. For TS acquisitions, we similarly employ both LHS candidates and tricands. As non-BO (not surrogate-based) comparators, serving primarily as benchmarks, we entertain raw L-BFGS-B and Nelder Mead [Nelder and Mead, 1965]. Two examples are showcased here, with a third example relegated to Appendix A.2. Our synthetic fs, including those in Section 4, are described in more detail on the pages of the Virtual Library for Simulation Experiments [VLSE; Surjanovic and Bingham, 2013]. In all of our experiments we code inputs to [0, 1]d and summarize results for 100 random restarts where each surrogate is initialized with the same (unique to each random restart) starting design of size n0 = 12, except in Section 3.2 where we use n0 = 60. This design is taken uniformly at random following the advice of Zhang et al. [2021] who caution that small space-filling initial designs can spark pathological behavior in BO. We track best observed value (BOV) f min n as a measure of progress which is summarized by median over n = 1, . . . , nend and by boxplots for particular n along the way. 3.1 Goldstein Price The Goldstein Price function is a popular low-dimensional (2d) benchmark for BO [e.g., Picheny et al., 2012]. A total of nend = 50 acquisitions are entertained, and all candidate-based methods (tricands and LHS) are limited to fifty candidates. This means max = Nsub = 50 for tricands, with fewer candidates when N < Nsub. The top row of Figure 4 summarizes results in three views. In the top-left panel, median progress in BOV is shown over increasing budgets of evaluations (n) as if each subsequent acquisition were the last. In the top-middle and right panels, boxplots capture the distribution of BOV at n = 30 and n = 50, respectively. In the top-left panel, methods based on a multi-start numerical local search use solid lines; those based on tricands are dashed; those based on LHS candidates are dotted. (Nelder Mead is an exception, being red-dashed.) In the boxplots, tricands use slightly heavier ink so that they stand out. Text printed at the top of the top-right panel indicates the average number of times the criterion, EI or TS, was evaluated in solving the inner optimization sub-problem(s), cumulative over all acquisitions. In the case of a numerically optimized EI (labeled EI ), these happen within iterations of L-BFGS-B search. Beginning with medians over n in the top-left panel, observe that the dashed lines (those based on tricands) are uniformly superior to all other comparators. EI with tricands is slightly better than TS with tricands. Only at the very end of the run does the raw Nelder Mead alternative look competitive. At the 30th evaluation (top-middle) the boxplots indicate that there are some MC repetitions where BOV based on tricands are underperforming. However, more than half of the time these are the best two (i.e., both EI and TS with tricands) of all. TS with tricands is the winner in terms of best-case performance, whereas the EI version has slightly better worst-case behavior. Neither local optimizer (red) is competitive. Finally, in the top-right panel we see that the story is similar, except that now EI with tricands edges out its TS analog. Although median performance for Nelder Mead is competitive, more than half of the time its BOV at nend = 50 is among the two worst in the experiment. 0 10 20 30 40 50 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Goldstein Price, no noise n: blackbox evaluations median best observed value Nelder Mead BFGS EI EI tri EI lhs TS tri TS lhs 3.0 2.5 2.0 1.5 1.0 30th evaluation best observed value final evaluation # of candidates/criteria evals 9021 1690 1900 1690 1900 0 50 100 150 200 250 300 2.5 2.0 1.5 1.0 0.5 Assemble to Order, input dependent noise n: blackbox evaluations median (estimated) best observed value EI EI tri EI hyb 2.6 2.4 2.2 2.0 150th evaluation (estimated) best observed value final evaluation Figure 4: Goldstein Price (top) and ATO (bottom) BOV over 100 MC trials: median (left panel) and distribution at intermediate (middle) and final (right) acquisitions. Suffixes -tri or -lhs denote a candidate search. Absence of a suffix (e.g., EI alone) indicates multi-start L-BFGS-B. Perhaps the most striking result from the figure is that tricands-based EI ( EI-tri ) outperforms its multi-start numerical analog ( EI ) despite a factor of five fewer evaluations of the EI criterion (about 9000 compared to 1700). A more aggressive, gradient-based search misutilizes computational resources. Finding a precise solution ultimately furnishing a maximal local value in an inferior domain of attraction may come at the expense of finding an accurate approximation near a global solution. The same is true, but to a lesser extent, when comparing TS variations (5% reduction). See Appendix A.2 for nearly identical results with the higher dimensional Hartmann 6 function. 3.2 Assemble to order (ATO) The assemble-to-order (ATO) simulator [Hong and Nelson, 2006] deploys a queuing system virtualizing a manufacturer receiving orders for a variety of widgets, each with different demand for component parts. Target inventory levels for eight parts comprise the controllable inputs (d = 8). The output is profit, a scalar distilling inventory and order fulfillment costs under random orders for parts, fulfillment and inventory replenishment times. ATO is coded in Matlab and exhibits input-dependent noise. Although it is reasonably fast (seconds per evaluation), relatively large n is required to separate signal from noise (n0 = 80 and nend = 300). Cubic computational bottlnecks for GP inference are in play here, despite using a thrifty heteroskedastic GP [Binois et al., 2018]. More details, including software, are provided in Appendix A.1. Since our Matlab licenses prevented us from running parallel instances, we had to limit our experiment somewhat to obtain results in a reasonable amount of time. Therefore we only entertained three acquisition alternatives: random multi-start numerically optimized EI, tricands with the default Nsub = 100d = 800, and a hybrid using the best tricands candidate to initialize a local numerically optimized EI ( EI-hyb ). The bottom panels of Figure 4 track BOV for these three alternatives in a view similar to the panels above. Note that the original ATO problem involves maximization, so we have negated the output and also scaled it so that its units were more similar to our other benchmark problems. Since the output is random, the BOVs are estimates. We used the surrogate fit at nend to assign in-sample predictions via µnend(xi). Otherwise, the story is similar to our earlier synthetic results. Tricands progress is initially slower than numerically optimized EI, but eventually gives lower BOV values. The difference at nend is slight, but statistically significant. A paired Wilcox test with alternative hypothesis that BOV for EI-tri is below EI has a p-value of 0.0037. The hybrid is even better, beating pure tricands and pure numerical optimization 85 and 88 out of 100 times, respectively. 4 Sampling-based surrogates Tricands are most valuable when the inner-optimization problem cannot be solved by library-based numerical methods, even locally. This happens when the surrogate predictive surface is discontinuous and/or when inference requires MCMC. Our examples here involve response surfaces exhibiting nonstationarity, meaning that the input output dynamics evolve over the input space. This demands a more elaborate surrogate. We consider two. A treed Gaussian process [TGP; Gramacy and Lee, 2008] uses axis-aligned partitioning with GPs. Such divide-and-conquer excels when dynamics change abruptly across individual inputs creating distinct regimes. MCMC can average over the location of probable partition boundaries. We use the R package tgp on CRAN [Gramacy, 2007], following Section 4 of Gramacy and Taddy [2010] for candidate-based EI acquisition. Deep Gaussian processes [DGPs; Damianou and Lawrence, 2013] warp inputs to accommodate a more subtle evolution of dynamics in the input space compared to the abrupt regime changes of TGP. Although variational inference is popular for DGPs [Salimbeni and Deisenroth, 2017], Sauer et al. [2020] argue that in active learning contexts, such as BO, full posterior integration leads to better uncertainty quantification, and thus better acquisitions. Here we use the R package deepgp on CRAN [Sauer, 2021] which supports EI evaluation on candidates. Unfortunately, neither tgp nor deepgp facilitate TS acquisition for BO. However, the tgp package furnishes a maximum a posteriori sample which can be used as the basis of a local search [Gramacy and Taddy, 2010, Section 4]. Here we show that tricands provides better candidates for both pure candidate EI search and this hybrid scheme. 4.1 Abrupt changes The Gramacy & Lee (G&L) function benefits from a model supporting hard breaks even though its dynamics evolve smoothly. The 2d input domain (coded to [0, 1]2) is mostly flat except in one quadrant where a local maximum is twinned with a (global) minimum. The domain of attraction of that global minimum covers only about 10% of the input space. It is easily missed by random initial designs and candidate sets. TGP is able to isolate the interesting quadrant with just two axis aligned partitions, thus recognizing that sampling effort should be concentrated there. The top row of Figure 5 shows results in the same three views as earlier. The number of candidates is limited to twenty, otherwise the setup is unchanged. First ignore the red lines and boxplots (those labeled hyb- ) and focus on the black (TGP) and green (DGP). Notice that in both cases the tricands-based comparators, dashed and/or bolded, outperform their LHS-based analogues. In terms of medians over n (top-left panel), that dominance is uniform. In terms of boxplots, the disparity is stark with the exception of a few outlying DGP-based BOV values with tricands. TGP seems to edge out DGP (with tricands) which we attribute to the abrupt change between the interesting quadrant and the rest of the input space. Now focus on the red hybresults. These are the ones where TGP candidate-based search is finished with a gradient-based EI search on the maximum a posteriori model. Again, tricands wins. To supplement the visuals, we report that the tricands hybrid BOV value was lowest in 97/100 reps (top-right). A higher dimensional example using the Michaelwicz function is provided in Appendix A.3. 0 10 20 30 40 50 0.4 0.3 0.2 0.1 0.0 Gramacy & Lee, no noise n: blackbox evaluations median best observed value TGP lhs TGP tri hyb lhs hyb tri DGP lhs DGP tri 0.42 0.40 0.38 0.36 0.34 0.32 0.30 30th evaluation best observed value final evaluation 0.05 0.10 0.15 0.20 Levy 5d, no noise n: blackbox evaluations median best observed value TGP lhs TGP tri hyb lhs hyb tri DGP lhs DGP tri 0.00 0.05 0.10 0.15 50th evaluation best observed value final evaluation Figure 5: Gramacy & Lee (top) and 5d Levy (bottom) with MCMC-based comparators. 4.2 Smooth changes The Levy function is defined in arbitrary dimension. It is ruffled with many peaks and valleys. Here we consider d = 5 for variety; see the bottom row of Figure 5. Commensurate with our other examples, we consider two hundred candidates and nend = 75. Both candidate DGP variations outperform their TGP counterparts. Slowly/smoothly varying dynamics favor warping inputs as opposed to hard partitioning. This is evident in both median and full distribution (boxplot) views. In the median view (bottom-left), notice that the dashed lines are uniformly under the solid ones of the same color. Tricands are providing better median performance over all n. Although the preeminence of tricands is apparent visually, we report that it provided better BOV in 81/100 MC restarts among the best (DGP) comparators. When tricands are involved, the hybrid candidate/numerical option is not discernibly better than the pure candidate alternative. 5 Discussion We offer a novel take on space-filling candidates for acquisition in BO. The idea is to fill the spaces in-between previous acquisitions. This is motivated by an analogy to bisection search, but also by the nature of GP predictive surfaces which often serve as surrogates in BO. GP surrogates have organically inflated uncertainty between training data sites, which makes those spots attractive for BO acquisition. This notion is extended to the space between the convex hull of existing training data and the boundary B of the study region. In an array of benchmark exercises we have demonstrated that these tricands lead to superior performance in BO compared to both higher-powered gradient-based acquisition schemes and simpler space-filling candidates. Tricands main attraction is that they mimic the behavior of higher-powered searches with the implementation simplicity of candidates. That simplicity means that tricands may be deployed where the higher-powered alternatives cannot, such as when the surrogate is not continuous or requires MCMC. Additional discussion on high input dimension and other extensions is provided in Appendix B. Acknowledgments and Disclosure of Funding We are grateful for funding from DOE LAB 17-1697 via subaward from Argonne National Laboratory for Sci DAC/DOE Office of Science ASCR and High Energy Physics, and for comments from two teams of anonymous referees which greatly improved the paper. Dario Azzimonti, David Ginsbourger, Clément Chevalier, Julien Bect, and Yann Richet. Adaptive design of experiments for conservative estimation of excursion sets. Technometrics, pages 1 14, 2020. Evan Baker, Pierre Barbillon, Arindam Fadikar, Robert B Gramacy, Radu Herbei, David Higdon, Jiangeng Huang, Leah R Johnson, Pulong Ma, Anirban Mondal, et al. Analyzing stochastic computer models: A review with opportunities. ar Xiv preprint ar Xiv:2002.01321, 2020. C. Bradford Barber, David P. Dobkin, and Hannu Huhdanpaa. The quickhull algorithm for convex hulls. ACM Trans. Math. Softw., 22(4):469 483, December 1996. ISSN 0098-3500. doi: 10.1145/235815.235821. URL https://doi.org/10.1145/235815.235821. R. Bates and L. Pronzato. Emulator-based global optimisation using lattices and delaunay tesselation. In Sensitivity Analysis of Model Output, pages 189 192, Madrid (Espagne), 2001. Julien Bect, David Ginsbourger, Ling Li, Victor Picheny, and Emmanuel Vazquez. Sequential design of computer experiments for the estimation of a probability of failure. Statistics and Computing, 22(3):773 793, 2012. Henrik Bengtsson. R.matlab: Read and Write MAT Files and Call MATLAB from Within R, 2018. URL https://CRAN.R-project.org/package=R.matlab. R package version 3.6.2. James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. Advances in neural information processing systems, 24, 2011. M Binois, RB Gramacy, and M Ludkovski. Practical heteroscedastic Gaussian process modeling for large simulation experiments. Journal of Computational and Graphical Statistics, 27(4):808 821, 2018. doi: 10.1080/10618600.2018.1458625. URL https://doi.org/10.1080/10618600.2018.1458625. M Binois, J Huang, RB Gramacy, and M Ludkovski. Replication or exploration? Sequential design for stochastic simulation experiments. Technometrics, 27(4):808 821, 2019. doi: 10.1080/00401706.2018.1469433. URL https://doi.org/10.1080/00401706.2018.1469433. Mickael Binois and Nathan Wycoff. A survey on high-dimensional Gaussian process modeling with application to bayesian optimization, 2021. Mickaël Binois and Robert B. Gramacy. hetgp: Heteroskedastic Gaussian process modeling and sequential design in r. Journal of Statistical Software, 98(13):1 44, 2021. doi: 10.18637/jss.v098.i13. URL https: //www.jstatsoft.org/index.php/jss/article/view/v098i13. Leo Breiman. Random forests. Machine learning, 45(1):5 32, 2001. AD Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(Oct):2879 2904, 2011. R.L. Burden and J.D. Faires. Numerical Analysis. PWS Publishers, 1985. Richard Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16, 02 2003. doi: 10.1137/0916069. Rob Carnell. lhs: Latin Hypercube Samples, 2018. URL https://CRAN.R-project.org/package=lhs. R package version 0.16. Clément Chevalier and David Ginsbourger. Fast computation of the multi-points expected improvement with applications in batch selection. In International Conference on Learning and Intelligent Optimization, pages 59 69. Springer, 2013. Clément Chevalier, Julien Bect, David Ginsbourger, Emmanuel Vazquez, Victor Picheny, and Yann Richet. Fast parallel kriging-based stepwise uncertainty reduction with application to the identification of an excursion set. Technometrics, 56(4):455 465, 2014. D Austin Cole, Robert B Gramacy, James E Warner, Geoffrey F Bomarito, Patrick E Leser, and William P Leser. Entropy-based adaptive design for contour finding and estimating reliability. ar Xiv preprint ar Xiv:2105.11357, 2021. Andreas Damianou and Neil D Lawrence. Deep Gaussian processes. In Artificial intelligence and statistics, pages 207 215. PMLR, 2013. Samuel Daulton, David Eriksson, Maximilian Balandat, and Eytan Bakshy. Multi-objective Bayesian optimiza- tion over high-dimensional search spaces, 2021. David Eriksson, Michael Pearce, Jacob Gardner, Ryan D Turner, and Matthias Poloczek. Scalable global optimization via local Bayesian optimization. In Advances in Neural Information Processing Systems, pages 5497 5508, 2019. Matthias Feurer, Benjamin Letham, Frank Hutter, and Eytan Bakshy. Practical transfer learning for bayesian optimization. ar Xiv preprint ar Xiv:1802.02219, 2018. Peter I Frazier, Warren B Powell, and Savas Dayanik. A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5):2410 2439, 2008. Roman Garnett. Bayesian Optimization. Cambridge University Press, 2022. in preparation. David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. A multi-points criterion for deterministic parallel global optimization based on kriging. In NCP07, 2007. J Gonzalez, M Osborne, and N Lawrence. GLASSES: Relieving the myopia of Bayesian optimisation. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 790 799, 2016. Javier González, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch bayesian optimization via local penalization. In Artificial intelligence and statistics, pages 648 657. PMLR, 2016. RB Gramacy. tgp: an R package for Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian process models. Journal of Statistical Software, 19(9):6, 2007. RB Gramacy. la GP: large-scale spatial modeling via local approximate Gaussian processes in R. Journal of Statistical Software, 72(1):1 46, 2016. RB Gramacy and Herbert KH Lee. Optimization under unknown constraints. In Bayesian Statistics, volume 9. Oxford University Press, 2011. RB Gramacy and HKH Lee. Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103(483):1119 1130, 2008. RB Gramacy and F Sun. la GP: Local Approximate Gaussian Process Regression, 2018. URL http://bobby. gramacy.com/r_packages/la GP. R package version 1.5-3. RB Gramacy and MA Taddy. Categorical inputs, sensitivity analysis, optimization and importance tempering with tgp version 2, an R package for treed Gaussian process models. Journal of Statistical Software, 33(6): 1 48, 2010. Robert B. Gramacy. Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences. Chapman Hall/CRC, Boca Raton, Florida, 2020. http://bobby.gramacy.com/surrogates/. Kai Habel, Raoul Grasman, Robert B. Gramacy, Pavlo Mozharovskyi, and David C. Sterratt. geometry: Mesh Generation and Surface Tessellation, 2019. URL https://CRAN.R-project.org/package=geometry. R package version 0.4.5. L.J. Hong and B.L. Nelson. Discrete optimization via simulation using compass. Operations Research, 54(1): 115 129, 2006. Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455 492, 1998. Remi Lam, Karen Willcox, and David H Wolpert. Bayesian optimization with a finite budget: An approximate dynamic programming approach. Advances in Neural Information Processing Systems, 29:883 891, 2016. ER Leatherman, TJ Santner, and AM Dean. Computer experiment designs for accurate prediction. Statistics and Computing, pages 1 13, 2017. Der-Tsai Lee and Bruce J Schachter. Two algorithms for constructing a delaunay triangulation. International Journal of Computer & Information Sciences, 9(3):219 242, 1980. Benjamin Letham and Eytan Bakshy. Bayesian optimization for policy search via online-offline experimentation. J. Mach. Learn. Res., 20:145 1, 2019. Alexandre Marques, Remi Lam, and Karen Willcox. Contour location via entropy reduction leveraging multiple information sources. In Advances in Neural Information Processing Systems, pages 5217 5227, 2018. D Mckay, Richard Beckman, and William Conover. A comparison of three methods for selecting vales of input variables in the analysis of output from a computer code. Technometrics, 21:239 245, 05 1979. Jonas Moˇckus. On bayesian methods for seeking the extremum. In Optimization techniques IFIP technical conference, pages 400 404. Springer, 1975. JA Nelder and R Mead. A simplex method for function minimization. The Computer Journal, 7(4):308 313, Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. Neur IPS Autodiff Workshop, 2017. V Picheny, T Wagner, and D Ginsbourger. A benchmark of kriging-based infill criteria for noisy optimization. Structural and Multidisciplinary Optimization, pages 1 20, 2012. Tony Pourmohamad and Herbert KH Lee. Bayesian Optimization with Application to Computer Experiments. Springer, New York, NY, 2021. Pritam Ranjan, Derek Bingham, and George Michailidis. Sequential experiment design for contour estimation from complex computer codes. Technometrics, 50(4):527 541, 2008. Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. ar Xiv preprint ar Xiv:1705.08933, 2017. Annie Sauer. deepgp: Sequential Design for Deep Gaussian Processes using MCMC, 2021. URL https: //CRAN.R-project.org/package=deepgp. R package version 0.3.0. Annie Sauer, Robert B Gramacy, and David Higdon. Active learning for deep Gaussian process surrogates. ar Xiv preprint ar Xiv:2012.08015, 2020. Warren Scott, Peter Frazier, and Warren Powell. The correlated knowledge gradient for simulation optimization of continuous parameters using Gaussian process regression. SIAM Journal on Optimization, 21(3):996 1026, 2011. Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009. Peter Su and Robert L Scot Drysdale. A comparison of sequential delaunay triangulation algorithms. Computa- tional Geometry, 7(5-6):361 385, 1997. S Surjanovic and D Bingham. Virtual library of simulation experiments: test functions and datasets. http: //www.sfu.ca/~ssurjano, 2013. MA Taddy, HKH Lee, GA Gray, and JD Griffin. Bayesian guided pattern search for robust local optimization. Technometrics, 51(4):389 401, 2009. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. Ryan Turner, David Eriksson, Michael Mc Court, Juha Kiili, Eero Laaksonen, Zhen Xu, and Isabelle Guyon. Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. ar Xiv preprint ar Xiv:2104.10201, 2021. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, Ilhan Polat, Yu Feng, Eric W. Moore, Jake Vander Plas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020. doi: 10.1038/s41592-019-0686-2. Linnan Wang, Rodrigo Fonseca, and Yuandong Tian. Learning search space partition for black-box optimization using monte carlo tree search. Advances in Neural Information Processing Systems, 33:19511 19522, 2020. Boya Zhang, Robert B Gramacy, Leah Johnson, Kenneth A Rose, and Eric Smith. Batch-sequential design and heteroskedastic surrogate modeling for delta smelt conservation. ar Xiv preprint ar Xiv:2010.06515, 2020a. Boya Zhang, D Austin Cole, and Robert B Gramacy. Distance-distributed design for Gaussian process surrogates. Technometrics, 63(1):40 52, 2021. Yichi Zhang, Daniel W Apley, and Wei Chen. Bayesian optimization for materials design with mixed quantitative and qualitative variables. Scientific reports, 10(1):1 13, 2020b.