# maximizing_acquisition_functions_for_bayesian_optimization__11bb4780.pdf Maximizing acquisition functions for Bayesian optimization James T. Wilson Imperial College London Frank Hutter University of Freiburg Marc Peter Deisenroth Imperial College London Bayesian optimization is a sample-efficient approach to global optimization that relies on theoretically motivated value heuristics (acquisition functions) to guide its search process. Fully maximizing acquisition functions produces the Bayes decision rule, but this ideal is difficult to achieve since these functions are frequently non-trivial to optimize. This statement is especially true when evaluating queries in parallel, where acquisition functions are routinely non-convex, highdimensional, and intractable. We first show that acquisition functions estimated via Monte Carlo integration are consistently amenable to gradient-based optimization. Subsequently, we identify a common family of acquisition functions, including EI and UCB, whose properties not only facilitate but justify use of greedy approaches for their maximization. 1 Introduction Bayesian optimization (BO) is a powerful framework for tackling complicated global optimization problems [32, 40, 44]. Given a black-box function f : X ! Y, BO seeks to identify a maximizer x 2 arg maxx2X f(x) while simultaneously minimizing incurred costs. Recently, these strategies have demonstrated state-of-the-art results on many important, real-world problems ranging from material sciences [17, 57], to robotics [3, 7], to algorithm tuning and configuration [16, 29, 53, 56]. From a high-level perspective, BO can be understood as the application of Bayesian decision theory to optimization problems [11, 14, 45]. One first specifies a belief over possible explanations for f using a probabilistic surrogate model and then combines this belief with an acquisition function L to convey the expected utility for evaluating a set of queries X. In theory, X is chosen according to Bayes decision rule as L s maximizer by solving for an inner optimization problem [19, 42, 59]. In practice, challenges associated with maximizing L greatly impede our ability to live up to this standard. Nevertheless, this inner optimization problem is often treated as a black-box unto itself. Failing to address this challenge leads to a systematic departure from BO s premise and, consequently, consistent deterioration in achieved performance. To help reconcile theory and practice, we present two modern perspectives for addressing BO s inner optimization problem that exploit key aspects of acquisition functions and their estimators. First, we clarify how sample path derivatives can be used to optimize a wide range of acquisition functions estimated via Monte Carlo (MC) integration. Second, we identify a common family of submodular acquisition functions and show that its constituents can generally be expressed in a more computer-friendly form. These acquisition functions properties enable greedy approaches to efficiently maximize them with guaranteed near-optimal results. Finally, we demonstrate through comprehensive experiments that these theoretical contributions directly translate to reliable and, often, substantial performance gains. Correspondence to j.wilson17@imperial.ac.uk 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. Posterior belief Inner optimization problem 0.00 0.25 0.50 0.75 1.00 x R Expected utility 3 256 512 768 1024 Num. prior observations CPU Seconds Runtimes for inner optimization Parallelism q = 32 q = 16 q = 8 q = 4 q = 2 3 256 512 768 1024 Num. prior observations CPU Seconds Runtimes for inner optimization Parallelism q = 32 q = 16 q = 8 q = 4 q = 2 Inner optimization problem Algorithm 1 BO outer-loop (joint parallelism) 1: Given model M, acquisition L, and data D 2: for t = 1, . . . , T do 3: Fit model M to current data D 4: Set q = min(qmax, T t) 5: Find X 2 arg max X02X q L(X0) 6: Evaluate y f(X) 7: Update D D [ {(xk, yk)}q k=1 8: end for 1: Given model M, acquisition L and data D 2: for t = 1, . . . , T do 3: Fit model M to current data D 4: Set q = min(qmax, T t) 5: Find X 2 arg max X02X q L(X0) 6: Evaluate y f(X) 7: Update D D [ {(xi, yi)}q i=1 8: end for Figure 1: (a) Pseudo-code for standard BO s outer-loop with parallelism q; the inner optimization problem is boxed in red. (b c) GP-based belief and expected utility (EI), given four initial observations . The aim of the inner optimization problem is to find the optimal query . (d) Time to compute 214 evaluations of MC q-EI using a GP surrogate for varied observation counts and degrees of parallelism. Runtimes fall off at the final step because q decreases to accommodate evaluation budget T = 1, 024. 2 Background Bayesian optimization relies on both a surrogate model M and an acquisition function L to define a strategy for efficiently maximizing a black-box function f. At each outer-loop iteration (Figure 1a), this strategy is used to choose a set of queries X whose evaluation advances the search process. This section reviews related concepts and closes with discussion of the associated inner optimization problem. For an in-depth review of BO, we defer to the recent survey [52]. Without loss of generality, we assume BO strategies evaluate q designs X 2 Rq d in parallel so that setting q = 1 recovers purely sequential decision-making. We denote available information regarding f as D = {(xi, yi)}... i=1 and, for notational convenience, assume noiseless observations y = f(X). Additionally, we refer to L s parameters (such as an improvement threshold) as and to M s parameters as . Henceforth, direct reference to these terms will be omitted where possible. Surrogate models A surrogate model M provides a probabilistic interpretation of f whereby possible explanations for the function are seen as draws f k p(f|D). In some cases, this belief is expressed as an explicit ensemble of sample functions [28, 54, 60]. More commonly however, M dictates the parameters of a (joint) distribution over the function s behavior at a finite set of points X. By first tuning the model s (hyper)parameters to explain for D, a belief is formed as p(y|X, D) = p(y; ) with M(X; ). Throughout, M(X; ) is used to denote that belief p s parameters are specified by model M evaluated at X. A member of this latter category, the Gaussian process prior (GP) is the most widely used surrogate and induces a multivariate normal belief , (µ, ) M(X; ) such that p(y; ) = N(y; µ, ) for any finite set X (see Figure 1b). Acquisition functions With few exceptions, acquisition functions amount to integrals defined in terms of a belief p over the unknown outcomes y = {y1, . . . , yq} revealed when evaluating a blackbox function f at corresponding input locations X = {x1, . . . , xq}. This formulation naturally occurs as part of a Bayesian approach whereby the value of querying X is determined by accounting for the utility provided by possible outcomes yk p(y|X, D). Denoting the chosen utility function as , this paradigm leads to acquisition functions defined as expectations L(X; D, ) = Ey [ (y; )] = (y; )p(y|X, D)dy . (1) A seeming exception to this rule, non-myopic acquisition functions assign value by further considering how different realizations of Dk q D [ {(xi, yk i=1 impact our broader understanding of f and usually correspond to more complex, nested integrals. Figure 1c portrays a prototypical acquisition surface and Table 1 exemplifies popular, myopic and non-myopic instances of (1). Inner optimization problem Maximizing acquisition functions plays a crucial role in BO as the process through which abstract machinery (e.g. model M and acquisition function L) yields concrete actions (e.g. decisions regarding sets of queries X). Despite its importance however, this inner optimization problem is often neglected. This lack of emphasis is largely attributable to a greater Abbr. Acquisition Function L Reparameterization MM EI Ey[max(Re LU(y ))] Ez[max(Re LU(µ + Lz ))] Y PI Ey[max( (y ))] Ez[max(σ( µ+Lz SR Ey[max(y)] Ez[max(µ + Lz)] Y UCB Ey[max(µ + β /2|γ|)] Ez[max(µ + β /2|Lz|)] Y ES Eya[H(Eyb|ya[ +(yb max(yb))])] Eza[H(Ezb[softmax( µb|a+Lb|azb KG Eya[max(µb + b,a 1 a,a(ya µa))] Eza[max(µb + b,a 1 a,a Laza)] N Table 1: Examples of reparameterizable acquisition functions; the final column indicates whether they belong to the MM family (Section 3.2). Glossary: +/ denotes the right-/left-continuous Heaviside step function; Re LU and σ rectified linear and sigmoid nonlinearities, respectively; H the Shannon entropy; an improvement threshold; a temperature parameter; LL> , the Cholesky factor; and, residuals γ N (0, ). Lastly, non-myopic acquisition function (ES and KG) are assumed to be defined using a discretization. Terms associated with the query set and discretization are respectively denoted via subscripts a and b. focus on creating new and improved machinery as well as on applying BO to new types of problems. Moreover, elementary examples of BO facilitate L s maximization. For example, optimizing a single query x 2 Rd is usually straightforward when x is low-dimensional and L is myopic. Outside these textbook examples, however, BO s inner optimization problem becomes qualitatively more difficult to solve. In virtually all cases, acquisition functions are non-convex (frequently due to the non-convexity of plausible explanations for f). Accordingly, increases in input dimensionality d can be prohibitive to efficient query optimization. In the generalized setting with parallelism q 1, this issue is exacerbated by the additional scaling in q. While this combination of non-convexity and (acquisition) dimensionality is problematic, the routine intractability of both non-myopic and parallel acquisition poses a commensurate challenge. As is generally true of integrals, the majority of acquisition functions are intractable. Even Gaussian integrals, which are often preferred because they lead to analytic solutions for certain instances of (1), are only tractable in a handful of special cases [13, 18, 20]. To circumvent the lack of closed-form solutions, researchers have proposed a wealth of diverse methods. Approximation strategies [13, 15, 60], which replace a quantity of interest with a more readily computable one, work well in practice but may not to converge to the true value. In contrast, bespoke solutions [10, 20, 22] provide (near-)analytic expressions but typically do not scale well with dimensionality.2 Lastly, MC methods [27, 47, 53] are highly versatile and generally unbiased, but are often perceived as non-differentiable and, therefore, inefficient for purposes of maximizing L. Regardless of the method however, the (often drastic) increase in cost when evaluating L s proxy acts as a barrier to efficient query optimization, and these costs increase over time as shown in Figure 1d. In an effort to address these problems, we now go inside the outer-loop and focus on efficient methods for maximizing acquisition functions. 3 Maximizing acquisition functions This section presents the technical contributions of this paper, which can be broken down into two complementary topics: 1) gradient-based optimization of acquisition functions that are estimated via Monte Carlo integration, and 2) greedy maximization of myopic maximal acquisition functions. Below, we separately discuss each contribution along with its related literature. 3.1 Differentiating Monte Carlo acquisitions Gradients are one of the most valuable sources of information for optimizing functions. In this section, we detail both the reasons and conditions whereby MC acquisition functions are differentiable and further show that most well-known examples readily satisfy these criteria (see Table 1). 2By near-analytic, we refer to cases where an expression contains terms that cannot be computed exactly but for which high-quality solvers exist (e.g. low-dimensional multivariate normal CDF estimators [20, 21]). We assume that L is an expectation over a multivariate normal belief p(y|X, D) = N(y; µ, ) specified by a GP surrogate such that (µ, ) M(X). More generally, we assume that samples can be generated as yk p(y|X, D) to form an unbiased MC estimator of an acquisition function L(X) Lm(X) , 1 k=1 (yk). Given such an estimator, we are interested in verifying whether r L(X) r Lm(X) , 1 k=1 r (yk), (2) where r denotes the gradient of utility function taken with respect to X. The validity of MC gradient estimator (2) is obscured by the fact that yk depends on X through generative distribution p and that r Lm is the expectation of s derivative rather than the derivative of its expectation. Originally referred to as infinitesimal perturbation analysis [8, 24], the reparameterization trick [37, 50] is the process of differentiating through an MC estimate to its generative distribution p s parameters and consists of two components: i) reparameterizing samples from p as draws from a simpler base distribution ˆp, and ii) interchanging differentiation and integration by taking the expectation over sample path derivatives. Reparameterization Reparameterization is a way of interpreting samples that makes their differentiability w.r.t. a generative distribution s parameters transparent. Often, samples yk p(y; ) can be re-expressed as a deterministic mapping φ : Z ! Y of simpler random variates zk ˆp(z) [37, 50]. This change of variables helps clarify that, if is a differentiable function of y = φ(z; ), then d dφ d by the chain rule of (functional) derivatives. If generative distribution p is multivariate normal with parameters = (µ, ), the corresponding mapping is then φ(z; ) , µ + Lz, where z N(0, I) and L is s Cholesky factor such that LL> = . Rewriting (1) as a Gaussian integral and reparameterizing, we have (y)N(y; µ, )dy = a0 (µ + Lz)N(z; 0, I)dz , (3) where each of the q terms c0 i in both a0 and b0 is transformed as c0 i = (ci µi P j (y