# bayesian_optimization_with_unknown_search_space__5279aec0.pdf

Bayesian Optimization with Unknown Search Space

Huong Ha, Santu Rana, Sunil Gupta, Thanh Nguyen, Hung Tran-The, Svetha Venkatesh Applied Artiﬁcial Intelligence Institute (A2I2) Deakin University, Geelong, Australia {huong.ha, santu.rana, sunil.gupta, thanhnt, hung.tranthe, svetha.venkatesh}@deakin.edu.au

Applying Bayesian optimization in problems wherein the search space is unknown is challenging. To address this problem, we propose a systematic volume expansion strategy for the Bayesian optimization. We devise a strategy to guarantee that in iterative expansions of the search space, our method can ﬁnd a point whose function value within ϵ of the objective function maximum. Without the need to specify any parameters, our algorithm automatically triggers a minimal expansion required iteratively. We derive analytic expressions for when to trigger the expansion and by how much to expand. We also provide theoretical analysis to show that our method achieves ϵ-accuracy after a ﬁnite number of iterations. We demonstrate our method on both benchmark test functions and machine learning hyper-parameter tuning tasks and demonstrate that our method outperforms baselines.

1 Introduction

Choosing where to search matters. A time-tested path in the quest for new products or processes is through experimental optimization. Bayesian optimization offers a sample efﬁcient strategy for experimental design by optimizing expensive black-box functions [9 11]. But one problem is that users need to specify a bounded region to restrict the search of the objective function extrema. When tackling a completely new problem, users do not have prior knowledge, hence there is no guarantee that an arbitrarily deﬁned search space contains the global optimum. Thus application of the Bayesian optimization framework when the search region is unknown remains an open challenge [16].

One approach is to use a regularized acquisition function such that its maximum can never be at inﬁnity - hence no search space needs to be declared and an unconstrained optimizer can be used [16]. Other approaches use volume expansion, i.e. starting from the user-deﬁned region, the search space is expanded during the optimization. The simplest strategy is to repeatedly double the volume of the search space every several iterations [16]. Nguyen et al suggest a volume expansion strategy based on the evaluation budget [12]. All these methods require users to specify critical parameters - as example, regularization parameters [16], or growth rate, expansion frequency (volume doubling) [16] or budget [12]. These parameters are difﬁcult to specify in practice. Additionally, [12] is computationally expensive and the user-deﬁned search space needs to be close to the global optimum.

In this paper, we propose a systematic volume expansion strategy for the Bayesian optimization framework wherein the search space is unknown. Without any prior knowledge about the objective function argmax or strict assumptions on the behavior of the objective function, it is impossible to guarantee the global convergence when the search space is continuously expanded. To circumvent this problem, we consider the setting where we achieve the global ϵ-accuracy condition, that is, we aim to ﬁnd a point whose function value is within ϵ of the objective function global maximum.

Our volume expansion strategy is based on two guiding principles: 1) The algorithm can reach a point whose function value is within ϵ of the objective function maximum in one expansion, and, 2) the search space should be minimally expanded so that the algorithm does not spend unnecessary

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

evaluations near the search space boundary. As the objective function is unknown, it is not possible to compute this ideal expansion region. Using the GP-UCB acquisition function as a surrogate, this region is computed as one that contains at least one point whose acquisition function value is within ϵ of the acquisition function maximum. However, by using a surrogate to approximate the objective function, there is no guarantee that we can achieve the global ϵ-accuracy within one expansion. Hence multiple expansions are required, and a new expansion is triggered when the local ϵ-accuracy is satisﬁed, i.e. when the algorithm can ﬁnd a point whose function value is within ϵ of the objective function maximum in the current search space. Analytical expressions for the size of the new expansion space and when to trigger the expansion are derived. The guarantees for the ϵ-accuracy condition, however, now lapses in the expanded region, and so we adjust the acquisition function appropriately to maintain the guarantee. Finally, we provide theoretical analysis to show that our proposed method achieves the global ϵ-accuracy condition after a ﬁnite number of iterations.

We demonstrate our algorithm on ﬁve synthetic benchmark functions and three real hyperparameter tuning tasks for common machine learning models: linear regression with elastic net, multilayer perceptron and convolutional neural network. Our experimental results show that our method achieves better function values with fewer samples compared to state-of-the-art approaches. In summary, our contributions are:

Formalising the analysis for Bayesian optimization framework in an unknown search space setting, and introducing ϵ-accuracy as a way to track the algorithmic performance;

Providing analytic expressions for how far to expand the search space and when to expand the search space to achieve global ϵ-accuracy;

Deriving theoretical global ϵ-accuracy convergence; and,

Demonstrating our algorithm on both synthetic and real-world problems and comparing it against state-of-the-art methods.

Our method differs from previous works in that 1) our method does not require any algorithmic parameters, automatically adjusting both when to trigger the expansion and by how much to expand, and, 2) our approach is the only one to guarantee the global ϵ-accuracy condition. This is because we guarantee the local ϵ-accuracy condition in each search space, thus eventually the global ϵaccuracy is achieved. Without this local guarantee, the suggested solution cannot be guaranteed to reach global ϵ-accuracy. The regularization [16] and the ﬁltering method [12] require the global optimum to be within a bound constructed by either the user speciﬁed regularizer or the budget. The volume doubling method [16] can continue to expand the search space to inﬁnity, however, the local ϵ-accuracy condition is not guaranteed in each search space.

The paper is organized as follows. Section 2 gives an overview of Bayesian optimization and discusses some of the related work. Section 3 describes the problem setup. Section 4 proposes our new expansion strategy for the Bayesian optimization framework when the search space is unknown. A theoretical analysis for our proposed method is presented in Section 5. In Section 6, we demonstrate the effectiveness of our algorithm by numerical experiments. Finally, Section 7 concludes the paper.

2 Background and Related Work

2.1 Background

Bayesian optimization is a powerful optimization method to ﬁnd the global optimum of an unknown objective function f(x) by sequential queries [9 11, 17, 18]. First, at time t, a surrogate model is used to approximate the behaviour of f(x) using all the current observed data Dt 1 = {(xi, yi)}n i=1, yi = f(xi) + ξi, where ξi N(0, σ2) is the noise. Second, an acquisition function is constructed from the surrogate model that suggests the next point xitr to be evaluated. The objective function is then evaluated at xitr and the new data point (xitr, yitr) is added to Dt 1. These steps are conducted in an iterative manner to get the best estimate of the global optimum.

The most common choice for the surrogate model used in Bayesian optimization is the Gaussian Process (GP) [14]. Assume the function f follows a GP with mean function m0(x) and covariance function k(x, x ), the posterior distribution of f given the observed data Dt 1 = {(xi, yi)}n i=1 is a

GP with the following posterior mean and variance,

µt 1(x) = m0(x) + k|Dt 1|(x)T (K|Dt 1| + σ2I|Dt 1|) 1y|Dt 1|,

σ2 t 1(x) = k(x, x) k|Dt 1|(x)T (K|Dt 1| + σ2I|Dt 1|) 1k|Dt 1|(x), (1)

where y|Dt 1| = [y1, . . . , y|Dt 1|]T , k|Dt 1|(x) = [k(x, xi)]|Dt 1| i=1 , K|Dt 1| = [k(xi, xj)]i,j, I|Dt 1| is the |Dt 1| |Dt 1| identity matrix and |Dt 1| denotes the cardinality of Dt 1. To aid readability, in the sequel we remove the notation that shows the dependence of k, K, I, y on |Dt 1|.

There are many existing acquisition functions [6, 7, 10, 11, 20] and in this paper, we focus only on the GP-UCB acquisition function [1, 2, 5, 19]. The GP-UCB acquisition function is deﬁned as,

αUCB(x; Dt 1) = µt 1(x) + p

βtσt 1(x), (2)

where µt 1(x), σt 1(x) are the posterior mean and standard deviation of the GP given observed data Dt 1 and βt 0 is an appropriate parameter that balances the exploration and exploitation. Given a search domain, {βt} can be chosen as in [19] to ensure global convergence in this domain.

2.2 Related Work

All the work related to the problem of Bayesian optimization with unknown search space have been described in Section 1. There is the work in [3] introduces the term ϵ-accuracy. However, their purpose is to unify the Bayesian optimization and the Level-set estimation framework.

3 Problem Setup

We wish to ﬁnd the global argmax xmax of an unknown objective function f : Rd 7 R, whose argmax is at a ﬁnite location, i.e. xmax = argmaxx S f(x), (3)

where S is a ﬁnite region that contains the argmax of the function f(x). In practice, the region S is not known in advance, so users need to identify a search domain Suser which is likely to contain the argmax of f(x). This search domain can be set arbitrarily or based on limited prior knowledge. Thus there is no guarantee that Suser contains the global optimum of the objective function. In the trivial cases when the search space S is known or when S Suser, the global convergence can be guaranteed through classical analysis [4, 19]. Here, we consider the general case when S may or may not be a subset of Suser. Without any prior knowledge about S or strict assumptions on the behavior of the objective function, it is impossible to guarantee the global convergence. Therefore, in this work, instead of solving Eq. (3), we consider the setting where we achieve the global ϵ-accuracy condition. That is, for a small positive value ϵ, we ﬁnd a solution xϵ which satisﬁes,

f(xmax) f(xϵ) ϵ. (4)

4 Proposed Approach

We make some mild assumptions to develop our main results.

Assumption 4.1 The prior mean function m0(x) = 0.

This is done by subtracting the mean from all observations and is common practice.

Assumption 4.2 The kernel k(x, x ) satisﬁes, (1) when x x 2 + , k(x, x ) 0; (2) k(x, x ) 1 (x, x ) ; (3) k(x, x) = θ2, where θ 0 is the scale factor of the kernel function.

Various kernels satisfy Assumption 4.2, e.g. the Matérn kernel, the Square Exponential kernel. As the function can always be re-scaled, condition 2 is met without loss of generality [15, 19]. Deﬁning gk(γ): With these types of kernels, for all small positive γ, there always exists gk(γ) > 0,

x, x : x x 2 gk(γ), k(x, x ) γ. (5)

The value of gk(γ) can be computed from γ and the kernel covariance function k(x, x ) i.e. for Squared Exponential kernel k SE(x, x ) = θ2exp( x x 2 2/(2l2)), gk(γ) will be p

2l2log(θ2/γ).

Assumption 4.3 The kernel k(x, x ) is known in advance or can be learned from the observations.

Figure 1: Expanded region (blue), when the GP-UCB acquisition function argmax is at (1) inﬁnity ; or (2) at a ﬁnite location and its function value is larger or equal βtθ + ϵ/2; or (3) at a ﬁnite location and its function value is smaller than βtθ + ϵ/2.

4.1 Proposed Expansion Strategy

The ideal expansion strategy should satisfy two characteristics: 1) The algorithm can reach the global ϵ-accuracy condition in one expansion, and, 2) the search space should be minimally expanded so that the algorithm does not spend unnecessary evaluations near the search space boundary. Since we have a black-box objective function, it is not possible to compute the ideal expansion space Sideal directly. Let the exploration-exploitation parameters {βt} be chosen to ensure the objective function is upper bounded by the GP-UCB acquisition function with high probability. Then we can estimate Sideal by a region S as a minimal region that contains at least one point whose acquisition function value is within ϵ from the acquisition function maximum, i.e. xu S : |αUCB(xu; Dt 1) maxx Rd αUCB(x; Dt 1)| ϵ. Due to the approximation, there is no guarantee we can achieve the global ϵ-accuracy in one expansion. Thus we need multiple expansions sequential. A new expansion is triggered when the local ϵ-accuracy is satisﬁed in the previous expansion. In the following, we ﬁrst derive the value of the GP-UCB acquisition function when x (Proposition 4.1), and then use this value to derive analytical expressions for the size of the expansion space S (Theorem 4.1) and when to trigger a new expansion.

Proposition 4.1 When x , the GP-UCB acquisition function αUCB(x; Dt 1) βtθ, where βt is the exploration-exploitation parameter of the GP-UCB acquisition function and θ is the scale factor of the kernel function k(x, x ).

Derivation of the expansion search space Our idea is to choose the region S such that S = Rd \ A, where 1) A contains all the points x that are far from all the current observations, and, 2) A := {x Rd : |αUCB(x; Dt 1) βtθ| < ϵ/2}. Here, we will show that with this choice of S, there exists at least one point in S whose acquisition function value is within ϵ from the acquisition function maximum, given ϵ < | βtθ minx Rd(αUCB(x; Dt 1))|. We consider three cases that can happen to the GP-UCB acquisition function (See Figure 1):

Case 1: The argmax of the GP-UCB acquisition function is at inﬁnity. This means that the GP-UCB acquisition function maximum is equal to βtθ. As the GP-UCB acquisition function is continuous and ϵ < | βtθ minx Rd(αUCB(x; Dt 1))|, hence, there exists a point xu such that αUCB(xu) = βtθ ϵ/2. By the deﬁnition of S, it is straightforward that xu belongs to S, thus proving that there exists a point in S whose GP-UCB acquisition function value is within ϵ from the maximum of the acquisition function. Case 2: The argmax of the GP-UCB acquisition function x max is at a ﬁnite location and its acquisition function value is larger or equal βtθ + ϵ/2. It is straightforward to see that the argmax x max belongs to the region S and this is the point that satisﬁes |αUCB(x max; Dt 1) maxx Rd αUCB(x; Dt 1)| ϵ. Case 3: The GP-UCB acquisition function argmax is at a ﬁnite location and the acquisition function maximum is smaller than βtθ + ϵ/2. As the GP-UCB acquisition function is continuous and ϵ < | βtθ minx Rd(αUCB(x; Dt 1))|, there exists a point xu S : αUCB(xu; Dt 1) = βtθ ϵ/2. As maxx Rd αUCB(x; Dt 1) < βtθ + ϵ/2, it follows directly that |αUCB(xu; Dt 1) maxx Rd αUCB(x; Dt 1)| ϵ.

Theorem 4.1 now formally derives an analytical expression for one way to deﬁne region S.

Algorithm 1 Bayesian optimization with unknown search space (GPUCB-UBO)

1: Input: Gaussian Process (GP) M, acquisition functions αUCB, αLCB, initial observations Dinit, initial search space Suser, function f, positive small threshold ϵ, evaluation budget T. 2: Output: Point xϵ : max f(x) f(xϵ) ϵ. 3: Initialize D0 = Dinit, S = Suser, β1, tk = 0. Update the GP using D0. 4: for t = 1, 2, . . . , T do 5: Set tlocal = t tk 6: Compute xm = argmaxx S αUCB(x; Dt 1) 7: Set xt = xm, yt = f(xt). Update Dt = Dt 1 (xt, yt). 8: / Compute the expansion trigger, the regret upper bound / 9: Compute rb = αUCB(xt; Dt 1) maxx Dt αLCB(x; Dt 1) + 1/t2 local 10: / If expansion triggered, expand the search space / 11: if (rb <= ϵ) | (t == 1) then 12: Compute the new search space S as deﬁned in Theorem 4.1 13: Set tk = tk + tlocal 14: end if 15: / Adjust the βt based on the search space / 16: Compute βt following Theorem 5.1 17: Update the GP using Dt. 18: end for

Theorem 4.1 Consider the GP-UCB acquisition function αUCB(x; Dt 1). Let us deﬁne the region S = S|Dt 1| i=1 Si, Si = {x : x xi 2 dϵ}, xi Dt 1, |Dt 1| is the cardinality of Dt 1, dϵ = gk(min( p

( βtθϵ/2 ϵ2/16)/(|Dt 1|λmax)/ βt, 0.25ϵ/max(P

zj 0 zj))) with gk(.) as in Eq. (5), λmax be the largest singular value of (K + σ2I) 1, and zj be the jth

element of (K + σ2I) 1y. Given ϵ < | βtθ minx Rd(αUCB(x; Dt 1))|, then there exists at least one point in S whose acquisition function value is within ϵ from the acquisition function maximum, i.e. xu S : |αUCB(xu; Dt 1) maxx Rd αUCB(x; Dt 1)| ϵ.

Acquisition function adaption Let us denote Sk as the kth expansion search space (k 1). In each Sk, the parameter {βt} of the GP-UCB acquisition function needs to be valid to ensure the algorithm achieves the local ϵ-accuracy condition. Hence, a new {βt} is adjusted after each expansion. Details on how to compute the new {βt} are in Theorem 5.1.

Triggering the next expansion To guarantee the global ϵ-accuracy condition, in each search space Sk, we aim to ﬁnd an iteration Tk which satisﬁes r Sk(Tk) = (maxx Sk f(x) maxxi DTk f(xi)) ϵ before the next expansion. As we do not have maxx Sk f(x) and {f(xi)}, we bound r Sk(t) by rb,Sk(t) = maxx Sk αUCB(x; Dt 1)+1/t2 maxx Dt αLCB(x; Dt 1), where αLCB(x; Dt 1) = µt 1(x) βtσt 1(x). The next expansion is triggered when rb,Sk(t) reaches ϵ.

Search space optimization The theoretical search space developed in Theorem 4.1 is the union of |Dt 1| balls. To suit optimizer input, this region is converted to an encompassing hypercube using,

minxi Dt 1(xk i ) dϵ xk maxxi Dt 1(xk i ) + dϵ, k = 1, d. (6) Further reﬁnement of the implementation is provided in the supplementary material.

Algorithm 1 describes the proposed Bayesian optimization with unknown search space algorithm.

5 Theoretical Analysis

First, to ensure the validity of our algorithm, we prove that for a wide range of kernels, for any search space Sk and any positive ϵ, with a proper choice of {βt}, our trigger for expansion condition occurs with high probability. When this happens, the algorithm achieves the local ϵ-accuracy condition.

Proposition 5.1 For any d-dimensional domain Sk with side length rk, for the kernel classes: ﬁnite dimensional linear, Squared Exponential and Matérn, suppose the kernel k(x, x ) satisﬁes the following condition on the derivatives of GP sample paths f: ak, bk > 0, Pr{supx Sk | f/ xj| >

L} ak exp (L/bk)2, j = 1, d. Pick δ (0, 1), and deﬁne βt = 2 log(t22π2/(3δ)) + 2d log(t2dbkrk p

log(4dak/δ)), then ϵ > 0, with probability larger than 1 δ, there Tk : t Tk, maxx Sk αUCB(x; Dt 1) maxx Dt αLCB(x; Dt 1) ϵ 1/t2; and t that satisﬁes the previous condition, maxx Sk f(x) maxx Dt f(x) ϵ.

Second, we prove that with a proper choice of {βt} and for a wide range class of kernels, after a ﬁnite number of iterations, our algorithm achieves the global ϵ-accuracy condition with high probability.

Theorem 5.1 Denote {Sk} as the series of the expansion search space suggested by our algorithm (k 1). In each Sk, let Tk be the smallest number of iterations that satisﬁes our expansion triggered condition, i.e. rb,Sk(Tk) ϵ. Suppose the kernel k(x, x ) belong to the kernel classes listed in Proposition 5.1 and it satisﬁes the following condition on the derivatives of GP sample paths f: ak, bk > 0, Pr{supx Sk | f/ xj| > L} ak exp (L/bk)2, j = 1, d. Pick δ (0, 1), and deﬁne, βt = 2 log((t P j k 1 Tj)22π2/(3δ)) + 2d log((t P j k 1 Tj)2dbkrk p

log(4dak/δ)), P

j k 1 Tj + 1 t P

j k Tj, k = 1, 2, .... Then running the proposed algorithm with the above choice of βt for a sample f of a GP with mean function zero and covariance function k(x, x ), after a ﬁnite number of iterations, we achieve global ϵ-accuracy with at least 1 δ probability, i.e.

Pr{f(xmax) f(xsuggest) ϵ} 1 δ,

where xsuggest is the algorithm recommendation and xmax is the objective function global argmax.

Discussion The difference between our method and previous works is that we guarantee the local ϵ-accuracy condition in each search space, eventually achieving the global ϵ-accuracy. Previous methods do not give this guarantee, and thus their ﬁnal solution may not reach global ϵ-accuracy.

6 Experimental Evaluation

We evaluate our method on ﬁve synthetic benchmark functions and three hyperparameter tuning tasks for common machine learning models. For problems with dimension d, the optimization evaluation budget is 10d (excluding initial 3d points following a latin hypercube sampling [8]). The experiments were repeated 30 and 20 times for the synthetic functions and machine learning hyperparameter tuning tasks respectively. For all algorithms, the Squared Exponential kernel is used, the GP models are ﬁtted using the Maximum Likelihood method and the output observations {yi} are normalized yi N(0, 1). As with previous GP-based algorithms that use conﬁdence bounds [3, 19], our theoretical choice of {βt} in Theorem 5.1 is typically overly conservative. Hence, following the suggestion in [19], for any algorithms that use the GP-UCB acquisition, we scale βt down by a factor of 5. Finally, for the synthetic functions, ϵ is set at 0.05 whist for the machine learning models, ϵ is set at 0.02 as we require higher accuracy in these cases.

We compare our proposed method, GPUCB-UBO, with seven baselines: (1) EI-Vanilla: the vanilla Expected Improvement (EI); (2) EI-Volx2: the EI with the search space volume doubled every 3d iterations [16]; (3) EI-H: the Regularized EI with a hinge-quadratic prior mean where β = 1 and R is the circumradius of the initial search space [16]; (4) EI-Q: the Regularized EI with a quadratic prior mean where the widths w are set to those of the initial search space [16]; (5) GPUCB-Vanilla: the vanilla GP-UCB; (6) GPUCB-Volx2: the GP-UCB with the search space volume doubled every 3d iterations [16]; (7) GPUCB-FBO: the GP-UCB with the ﬁtering expansion strategy in [12].

6.1 Visualization

We visualize our theoretical expansion search spaces derived in Theorem 4.1 on the Beale test function (Figure 2). We show the contour plots of the GP-UCB acquisition functions, and show both the observations (red stars) and the recommendation from the algorithm that correspond the acquisition function maximum (cyan stars). The initial user-deﬁned search space (black rectangle) is expanded as per theoretical search spaces developed in Theorem 4.1 (yellow rectangles). Here we use Eq. (6) to plot the expansion search spaces, however, the spaces developed in Theorem 4.1 are tighter. The ﬁgure illustrates that when the argmax of the objective function is outside of the user-deﬁned search space, with our search space expansion strategy, this argmax can be located within a ﬁnite number of expansions.

Figure 2: Expansion search spaces using Theorem 4.1 for Beale function in two cases when the global ϵ-accuracy is achieved within (a) one expansion; or (b) two expansions. The black rectangle is the user-deﬁned search space and the yellow rectangles are the theoretical expansion search spaces. The contour plots of the acquisition function are also displayed with observations (red stars) and the recommendation at that iteration (cyan star). Global optimum of Beale function is the magenta star.

6.2 Synthetic Benchmarks

We compare our method with seven baselines on ﬁve benchmark test functions: Beale, Eggholder, Levy 3, Hartman 3 and Hartman 6. We use the same experiment setup as in [16]. The length of the initial user-deﬁned search space is set to be 20% of the length of the function domain - e.g. if the function domain is the unit hypercube [0, 1]d, then the initial search space has side length of 0.2. The center of this initial search space is placed randomly in the domain of the objective function.

For each test function and algorithm, we run the experiment 30 times, and each time the initial search space will be placed differently. We plot the mean and the standard error of the best found values maxi=1,n f(xi) of each test function. Figure 3 shows that for most test functions, our method GPUCB-UBO achieves both better function values and in less iterations than other methods. For most test functions, our method is better than other six state-of-the-art approaches (except GPUCB-FBO) by a high margin. Compared with GPUCB-FBO, our method is better on the test functions Hartman3 and Hartman6 while performing similar on other three test functions. Note that the computation time of GPUCB-FBO is 2-3 times slower than our method and other approaches (see Table 1) because it needs an extra step to numerically solve several optimization problems to construct the new search space. Since we derive the expansion search spaces analytically, our method, in contrast, can optimize the acquisition function within these spaces without any additional computation.

Figure 3: Best found values of various synthetic benchmark test functions using different algorithms. Plotting mean and standard error over 30 repetitions. (Best seen in color)

Table 1: The average runtime (seconds) of selecting the next input of different methods. All the time measurements were taken when evaluating the methods on a Ubuntu 18.04.2 server with Intel Xeon CPU E5-2670 2.60GHz 128GB RAM. All the source codes are written in Python 3.6.

METHODS Beale Eggholder Hartman3 Levy3 Hartman6

GPUCB-UBO 2.8 0.2 2.8 0.3 3.1 0.5 3.7 0.5 5.0 0.9 EIH 3.4 0.2 1.0 0.01 1.2 0.03 4.9 0.2 1.4 0.02 EIQ 5.6 0.4 2.9 0.02 3.3 0.03 5.8 0.3 5.7 0.1 EI-Vol2 3.2 0.2 0.9 0.01 1.2 0.1 5.1 0.2 1.7 0.1 GPUCB-Vol2 3.5 0.4 1.6 0.05 9.4 0.7 2.9 0.1 12.0 1.1 GPUCB-FBO 5.6 0.4 5.4 0.2 8.3 1.1 8.6 0.3 18.8 2.9

6.3 Hyperparameter Tuning for Machine Learning Models

Next we apply our method on hyperparameter tuning of three machine learning models on the MNIST dataset: elastic net, multilayer perceptron and convolutional neural network. With each model, the experiments are repeated 20 times and each time the initial search space will be placed differently.

Elastic Net Elastic net is a regularized regression method that utilizes the L1 and L2 regularizers. In the model, the hyperparameter α > 0 determines the magnitude of the penalty and the hyperparameter l (0 l 1) balances between the L1 and L2 regularizers. We tune α in the normal space while l is tuned in an exponent space (base 10). The initial search space of α and l is randomly placed in the domain [ 3, 1] [0, 1] with side length to be 20% of the domain size length. We implement the Elastic net model using the function SGDClassiﬁer in the scikit-learn package [13].

Multilayer Perceptron (MLP) We construct a 2-layer MLP with 512 neurons/layer. We optimize three hypeparameters: the learning rate l and the L2 norm regularization hyperparameters lr1 and lr2 of the two layers. All the hyperparameters are tuned in the exponent space (base 10). The initial search space is a randomly placed unit cube in the cube [ 6, 1]3. The model is implemented using tensorﬂow. The model is trained with the Adam optimizer in 20 epochs and the batch size is 128.

Convolutional Neural Network (CNN) We consider a CNN with two convolutional layers. The CNN architecture (e.g. the number of ﬁlters, the ﬁlter shape, etc.) is chosen as the standard architecture published on the ofﬁcial Git Hub repository of tensorﬂow 1. We optimize three hyperparameters: the learning rate l and the dropout rates rd1, rd2 in the pooling layers 1 and 2. We tune rd1, rd2 in the normal space while l is tuned in an exponent space (base 10). The initial search space of rd1, rd2, l is randomly placed in the domain [0, 1] [0, 1] [ 5, 1] with side length to be 20% of this domain size length. The network is trained with the Adam optimizer in 20 epochs and the batch size is 128.

Figure 4: Prediction accuracy of different machine learning models on MNIST dataset using different algorithms. Mean and standard error over 20 repetitions are shown. (Best seen in color)

Given a set of hyperparameters, we train the models with this hyperparameter setting using the MNIST train dataset (55000 patterns) and then test the model on the MNIST test dataset (10000 patterns). Bayesian optimization method then suggests a new hyperparameter setting based on the

1https://github.com/tensorﬂow/tensorﬂow

prediction accuracy on the test dataset. This process is conducted iteratively until the evaluation budget (10d evaluations) is depleted. We plot the prediction accuracy in Figure 4. For the Elastic net model, our method GPUCB-UBO performs similar to GPUCB-FBO while outperforming the other six approaches signiﬁcantly. For the MLP model, GPUCB-UBO performs far better than other approaches. To be speciﬁc, after only 12 iterations, it achieves a prediction accuracy of 97.8% whilst other approaches take more than 24 iterations to get to this level. For the CNN model, GPUCB-UBO also outperforms other approaches by a high margin. After 30 iterations, it can provide a CNN model with prediction accuracy of 98.7%.

7 Conclusion

We propose a novel Bayesian optimization framework when the search space is unknown. We guarantee that in iterative expansions of the search space, our method can ﬁnd a point whose function value within ϵ of the objective function maximum. Without the need to specify any parameters, our algorithm automatically triggers a minimal expansion required iteratively. We demonstrate our method on both synthetic benchmark functions and machine learning hyper-parameter tuning tasks and demonstrate that our method outperforms state-of-the-art approaches.

Our source code is publicly available at https://github.com/Huong Ha12/BO_unknown_searchspace.

Acknowledgments

This research was partially funded by the Australian Government through the Australian Research Council (ARC). Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).

[1] P. Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397 422, 2003.

[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002.

[3] I. Bogunovic, J. Scarlett, A. Krause, and V. Cevher. Truncated variance reduction: A uniﬁed approach to bayesian optimization and level-set estimation. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), pages 1515 1523, USA, 2016.

[4] A.D. Bull. Convergence rates of efﬁcient global optimization algorithms. Journal of Machine Learning Research, 12:2879 2904, 2011.

[5] V. Dani, T.P. Hayes, and S.M. Kakade. Stochastic linear optimization under bandit feedback. In COLT, 2008.

[6] P. Hennig and C.J. Schuler. Entropy search for information-efﬁcient global optimization. Journal of Machine Learning Research, 13(1):1809 1837, 2012.

[7] J.M. Henrández-Lobato, M.W. Hoffman, and Z. Ghahramani. Predictive entropy search for efﬁcient global optimization of black-box functions. In Advances in Neural Information Processing Systems (NIPS), pages 918 926, 2014.

[8] D.R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21(4):345 383, 2001.

[9] D.R. Jones, M. Schonlau, and W.J. Welch. Efﬁcient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455 492, December 1998.

[10] H.J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97 106, 1964.

[11] J. Mo ckus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for seeking the extremum, volume 2 of Toward Global Optimization. Elsevier, 1978.

[12] V. Nguyen, S. Gupta, S. Rane, C. Li, and S. Venkatesh. Bayesian optimization in weakly speciﬁed search space. In 2017 IEEE International Conference on Data Mining (ICDM), pages 347 356, 2017.

[13] F. Pedregosa and G. Varoquaux et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825 2830, 2011.

[14] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.

[15] J. Scarlett. Tight regret bounds for Bayesian optimization in one dimension. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 4500 4508, Stockholmsmässan, Stockholm, Sweden, 2018.

[16] B. Shahriari, A. Bouchard-Cote, and N. De Freitas. Unbounded bayesian optimization via regularization. In Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 51, pages 1168 1176, 2016.

[17] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2016.

[18] J. Snoek, H. Larochelle, and R.P Adams. Practical bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2 (NIPS), NIPS 12, pages 2951 2959, USA, 2012.

[19] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML), pages 1015 1022, 2010.

[20] Z. Wang and S. Jegelka. Max-value entropy search for efﬁcient bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 3627 3635, 2017.