# bayesian_optimistic_optimisation_with_exponentially_decaying_regret__41ec86de.pdf

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

Hung Tran-The 1 Sunil Gupta 1 Santu Rana 1 Svetha Venkatesh 1

Bayesian optimisation (BO) is a well-known efﬁcient algorithm for ﬁnding the global optimum of expensive, black-box functions. The current practical BO algorithms have regret bounds ranging from O( log N

N), where N is the number of evaluations. This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation which are based on partitioning the search space. We propose the BOO algorithm, a ﬁrst practical approach which can achieve an exponential regret bound with order O(N

N) under the assumption that the objective function is sampled from a Gaussian process with a Mat ern kernel with smoothness parameter ν > 4 + D

2 , where D is the number of dimensions. We perform experiments on optimisation of various synthetic functions and machine learning hyperparameter tuning tasks and show that our algorithm outperforms baselines.

1. Introduction

We consider a global optimisation problem whose goal is to maximise f(x) subject to x X RD, where D is the number of dimensions and f is an expensive blackbox functions that can only be evaluated point-wise. The performance of a global optimisation algorithm is typically evaluated using simple regret, which is given as

r N = supx X f(x) max1 i Nf(xi)

where xi is the i-th sample, and N is the number of function evaluations. In this paper, we consider the case that the evaluation of f is noiseless.

Bayesian optimisation (BO) provides an efﬁcient modelbased solution for global optimisation. The core idea is to

1Applied Artiﬁcial Intelligence Institute, Deakin University, Geelong, Australia. Correspondence to: Hung Tran-The <hung.tranthe@deakin.edu.au>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

transform a global optimisation problem into a sequence of auxiliary optimisation problems of a surrogate function called the acquisition function. The acquisition function is built using a model of the function through its limited observations and recommends the next function evaluation location. Regret analysis has been done for many existing BO algorithms, and typically the regret is sub-linear follow-

ing order O q

(Srinivas et al., 2012; Russo et al.,

2018). More recently, (Vakili et al., 2020) have improved this to O 1

under the noiseless setting. However, a limitation of BO is that performing such a sequence of auxiliary optimisation problems is expensive.

De Freitas et al. (2012) introduced a Gaussian process (GP) based scheme called δ-cover sampling as an alternative to the acquisition function to trade-off exploration and exploitation. Their method samples the objective function using a ﬁnite lattice within a feasible region and doubles the density of points in the lattice at each iteration. However, even in moderate dimensions, their algorithm is impractical since the lattice quickly becomes too large to be sampled in a reasonable amount of time (as pointed out by (Wang et al., 2014; Kawaguchi et al., 2016)).

An alternative practical approach for global optimisation is to consider tree-based optimistic optimisation as in (Munos, 2011; Floudas, 2005). These algorithms partition the search space into ﬁner regions by building a hierarchical tree. The key is to have efﬁcient strategies to identify a set of nodes that may contain the global optimum and then to successively reduce and reﬁne the search space to reach closer to the optimum. As example, DIRECT algorithm (Jones et al., 1993) partitions the search space assuming a global Lipschitz constant. Simultaneous Optimistic Optimisation (SOO) algorithm (Munos, 2011) generalises this by using only local Lipschitz conditions. Under certain assumptions, this algorithm shows the possibility of achieving an exponentially diminishing regret O(e

N). An additional advantage of such algorithms is that we do not need to perform an auxiliary non-convex optimisation of the acquisition functions as in BO which may be difﬁcult in cases that are high-dimensional (Kandasamy, 2015; Tran-The et al., 2020) or have unbounded search spaces (Tran-The et al., 2020). However, these optimistic algorithms are model-free, that

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

is, they do not utilise the function observations efﬁciently.

A natural extension to improve the sample efﬁciency is to incorporate a model of the objective function into the optimistic strategy. Indeed, works that do this include Ba MSOO (Wang et al., 2014) and IMGPO (Kawaguchi et al., 2016). Using a Gaussian process (Rasmussen & Williams, 2005) as a model of the objective function, both algorithms avoid to evaluate the objective function for points known to be sub-optimal with high probability. While Ba MSOO has a sub-linear regret bound, IMGPO can achieve an exponential regret bound. However, despite this, IMGPO has not still overcome the worst-case regret bound order O(e

N) of SOO which does not use any model of the objective function. A natural question is that is there a practical algorithm for global optimisation that can break this regret bound order O(e

N) under a mild assumption?

In this paper, we propose a novel approach, which combines the strengths of the tree-based optimistic optimisation methods and Bayesian optimisation to achieve an improved regret bound O(N

N) in the worst case. Our main contributions are summarised as follows:

A GP-based optimistic optimisation algorithm using novel partitioning procedure and function sampling;

Our algorithm has a worst-case regret bound of O(N

N) in the noiseless setting under the assumption that the objective function is sampled from a Gaussian process with a Mat ern kernel with smoothness parameter ν > 4 + D

2 , where N is the number of evaluations and D is the number of dimensions. Our algorithm avoids an auxiliary optimisation step at each iteration in BO, and avoids the δ-cover sampling in the approach of De Freitas et al. (2012). To our best knowledge, without using an δ-cover sampling procedure which is impractical, this is the tightest regret bound for BO algorithms;

To validate our algorithm in practice, we perform experiments on optimisation of various synthetic functions and machine learning hyperparameter tuning tasks and show that our algorithm outperforms baselines.

2. Related Works

In this section, we brieﬂy review some related work additional to the work mentioned in section 1.

In Bayesian optimisation literature, there exist some works that use tree-structure for the search space. While Wang et al. (2018) used a Mondrian tree to partition the search space, a recent work by (Wang et al., 2020) used a dynamic tree via K-means algorithm. However, these works focused on improving BO s performance empirically for large-scale

data sets or high-dimensions rather than to improve the regret bound.

There are two viewpoints for BO, Bayesian and non Bayesian as pointed out by Scarlett et al. (2017). In the non-Bayesian viewpoint, the function is treated as ﬁxed and unknown, and assumed to lie in a reproducing kernel Hilbert space (RKHS). Under this viewpoint, Chowdhury & Gopalan (2017); Janz et al. (2020) provided upper regret bounds while Scarlett et al. (2017) provided lower regret bounds for BO with Mat ern kernels. These bounds all are sub-linear. Otherwise, in the Bayesian viewpoint where we assume that the underlying function is random according to a GP, Kawaguchi et al. (2016) showed that BO can obtain an exponential convergence rate. In this paper, we focus on the Bayesian viewpoint and break the regret bound order of IMGPO (Kawaguchi et al., 2016) under some mild assumptions.

The optimistic optimisation methods have also been extended to adapt to different problem settings e.g., noisy setting (Valko et al., 2013; Grill et al., 2015), high dimensional spaces (Qian & Yu, 2016; Al-Dujaili & Suresh, 2017), multi-objective optimisation (Al-Dujaili & Suresh, 2018), or multi-ﬁdelity black-box optimisation (Sen et al., 2018). Some works (Shilton et al., 2017; Theckel Joy et al., 2019) proposed the regret bounds for transfer learning in BO. Our work can be complementary to these works and the integration of our solution with them may be promising to improve their regret bounds.

3. Preliminaries

Bayesian Optimisation The standard BO routine consists of two key steps: estimating the black-box function from observations and maximizing an acquisition function to suggest next function evaluation point.

Gaussian process is a popular choice for the ﬁrst step. Formally, we have f(x) GP(m(x), k(x, x )) where m(x) and k(x, x ) are the mean and the covariance (or kernel) functions. Given a set of observations D1:p = {xi, yi}p i=1 under a noiseless observation model yi = f(xi), the predictive distribution can be derived as P(f(x)|D1:p, x) = N(µp+1(x), σ2 p+1(x)), where µp+1(x) = k T K 1y+m(x) and σ2 p+1(x) = k(x, x) k T K 1k. In the above expression we deﬁne k = [k(x, x1), ..., k(x, xp)]T , K = [k(xi, xj)]1 i,j p and y = [y1, . . . , yp].

Some well-known popular acquisition functions for the second step include upper conﬁdence bound (GPUCB)(Srinivas et al., 2012), expected improvement (EI) (Bull, 2011), Thompson sampling (TS) (Russo et al., 2018) and predictive entropy search (PES) (Hern andez-Lobato et al., 2014). Among them, GP-UCB is given as Up(x) =

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

µp(x) + β1/2 p σp(x), where βp is the parameter balancing between the exploration and exploitation. We will use GPUCB in our tree expansion scheme to determine the node to be expanded.

In this paper, we focus on the popular class of Mat ern kernels for Gaussian process which is deﬁned as

kν(x, x ) = σ2

Γ(ν)2ν 1 (||x x ||2

λ )νBν(||x x ||2

where Γ denotes the Gamma function, Bν denotes the modiﬁed Bessel function of the second kind, ν is a parameter controlling the smoothness of the function and σ2, λ are hyper-parameters of the kernel. We assume that the hyper-parameters are ﬁxed and known in advance. However, our work can also be extended for the unknown hyperparameters of the Mat ern kernel as in (Vakili et al., 2020) (for Bayesian setting). Important special cases of ν include ν = 1

2 that corresponds to the exponential kernel and ν that corresponds to the squared exponential kernel. The Mat ern kernel is of particular practical signiﬁcance, since it offers a more suitable set of assumptions for the modeling and optimisation of physical quantities ((Stein, 1999)).

Hierarchical Partition We use the hierarchical partition of the search space as in (Munos, 2011). Given a branch factor m, for any depth h, the search space X is partitioned into a set of mh sets Ah,i (called cells), where 0 i mh 1. This partitioning is represented as a m-ary tree structure where each cell Ah,i corresponds to a node (h, i). A node (h, i) has m children nodes, indexed as {(h + 1, ij)}1 j m. The children nodes {(h + 1, ij), 1 j m} form a partition of the parent s node (h, i). The root of the tree corresponds to the whole domain X. The center of a cell Ah,i is denoted by ch,i where f and its upper conﬁdence bound is evaluated.

4. Proposed BOO algorithm

4.1. Motivation

Most of tree-based optimistic optimisation algorithms like SOO, Sto SOO (Valko et al., 2013), Ba MSOO and IMGPO face a strict negative correlation between the branch factor m and the number of tree expansions given a ﬁxed function evaluation budget N. On the one hand, using a larger m makes a tree ﬁner, which helps to reach closer to the optimum. On the other hand, having more expansions in the tree also allows to create ﬁner partitions in multiple regions of the space. Thus both a larger branch factor and a larger number of tree expansions allow an algorithm to get closer to the optimum. However, each time a node is expanded, the algorithms such as SOO, Sto SOO spend m function evaluations - one for each of the m children and thus the number of

Figure 1. An illustration of partitioning procedure for m = 8: (a) SOO methods partition a cell by dividing the longest side into m equal parts; (b) our method sets m = ab and partitions a cell by dividing b = 3 longest sides into a = 2 equal parts.

tree expansions is restricted to at most N

m . Thus when m increases, the number of tree expansions decreases. We call this phenomenal the strict negative correlation of tree-based optimistic optimisation algorithms.

Using the assumption that the objective function is sampled from a GP prior, Ba MSOO and IMGPO reduce this negative correlation by evaluating the function only at the children where the UCB value is greater than the best function value observed thus far (f +). However, the number of expansions is still tied to the branch factor lying between N

We present a new approach which permits to untie the branch factor m from the number of tree expansions and hence, solves the strict negative correlation of tree-based optimistic optimisation algorithms. By doing so, we can exploit the use of a large m to achieve ﬁner partitions and achieves a regret bound O(N

N) improving upon current BO algorithms.

4.2. BOO algorithm

Our algorithm is described in Algorithm 1 where we assume that the objective function is a sample from GP as in Bayesian optimisation, however our approach follows the principle of SOO which uses a hierarchical partitioning of the search space X. The main difference of the proposed BOO and previous works lies in the partitioning procedure, the tree expansion mechanism and the function sampling strategy.

Partitioning Procedure Unlike SOO based algorithms including Ba MSOO and IMGPO which often divide a cell into m children cells along the longest side of the cell, we use a novel partitioning procedure which exploits the particular decomposition of the branch factor m. Given any a 2, 1 b D and m = ab, where a, b, m N, our partitioning procedure, denoted by P(m; a, b), divides the cell along its b longest dimensions into a new cells (see Figure 1). When b = 1 then our procedure becomes simply the

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

Algorithm 1 The BOO Algorithm Input: An evaluation budget N and parameters m, a, b N. Initialisation: Set T0 = {(0, 0)} (root node). Set p = 1. Sample initial points to build D0.

1: while True do 2: Set vmax = 3: for h = 0 to min(depth(Tp), hmax(p)) do 4: Among all leaves (h, j) of depth h, select (h, i) argmax(h,j) LUp(ch,j) 5: if Up(ch,i) vmax then 6: Expand node (h, i) by adding m children (h + 1, ij) to tree Tp, using partitioning procedure P(m; a, b) 7: Evaluate f(ch,i) 8: Augment the data Dp = {Dp 1, ((ch,i, f(ch,i))}. 9: Fit the Gaussian process using Dp 10: Update vmax = max{f(ch,i), vmax} 11: Update p = p + 1 12: if p = N then 13: Return x(N) = argmax{ch,i)|(ch,i,f(ch,i)) DN}f(ch,i) 14: end if 15: end if 16: end for 17: end while

traditional partitioning procedure as in SOO. When b = D, all dimensions of the cell are divided which beneﬁts our algorithm. We will explain this further in our convergence analysis.

Tree Expansion Mechanism The algorithm incrementally builds a tree Tp starting with a node T0 = {(0, 0)} for p = 1...N, where N is the evaluation budget. At depth h, among all the leaf nodes, denoted by L of the current tree, the algorithm selects the node with the maximum GPUCB value, deﬁned as Up(c) = µp(c) + β1/2 p σp(c), where β1/2 p = p

2log(π2p3/3η) and η (0, 1). The tree is expanded by adding m children nodes to the selected node. To force the depth of tree after p expansions, we use a function hmax(p) which is also a parameter of the algorithm. We note that the algorithm uses GP-UCB acquisition function to determine the candidate node, but it only performs a maximisation on a ﬁnite, discrete set comprising the leaf nodes at the depth in consideration.

Function Sampling Strategy Once the node is selected for expansion, unlike previous works that evaluate the objective function at children nodes, we propose to evaluate

Figure 2. SOO samples the function at all m children nodes while our sampling strategy samples the function only at the parent node (the node selected for expansion). As a result, our strategy requires only one function evaluation irrespective of the value of m.

the objective function only at that node without evaluating the function at its children (see Figure 2).

By this sampling scheme, our algorithm allows to untie the branch factor m from the number of tree expansions. As a result, it allows the use of a large m to achieve ﬁner partitions and to reach closer to the optimum. In fact, using a small m as in previous optimistic optimisation methods also reaches ﬁner partitions, however it needs a large number of tree expansions, and thus still needs a large number of evaluations. Consider an example where m = 2D with a = 2 and b = D. Using the partitioning procedure P(m; 2, D), our algorithm partitions a node into m cells with the same granularity. To reach the same granularity as our method, SOO algorithm can use the partitioning with m = 2 and repeat it D times. However, by this way, SOO always spends 2D function evaluations while our algorithm only uses one evaluation. Bam SOO and IMGPO algorithms have the similar problem although they improve over SOO - they only evaluate the function at nodes c that satisfy the condition U(c) > f + (best function value observed thus far) is satisﬁed. In summary, to reach the same granularity as our method, these algorithms need to spend m evaluations, where 1 m 2D depending on the number of nodes c satisfying the condition U(c) > f +. In contrast, our algorithm only spends one evaluation for all cases regardless of the value of m. Together with our partitioning procedure, we leverage this beneﬁt to improve the regret bound for optimisation.

5. Convergence Analysis

In this section, we theoretically analyse the convergence of our algorithm. We start with assumptions about function f.

5.1. Assumptions

To guarantee the correctness of our algorithm, we use the following assumptions.

Assumption 1. The function f sampled from GP(0, kν), that is a zero mean GP with a Mat ern kernel kν with ν > 4 + D

2 , where D is the number of dimensions.

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

Assumption 2. The objective function f has a unique global maximum x .

The assumption of a unique maximiser holds with probability one in most non-trivial cases (De Freitas et al., 2012). Under such assumptions, we obtain the following property.

Property 1. Assume that the function f is sampled from GP(0, kν) satisfying the Assumption 1 and 2. Then,

1. f(x ) f(x) L1||x x||2 2 for every x X, for some constant L1 > 0,

2. f(x) f(x ) L2||x x ||2 for every x B(x , θ) for some constants L2, θ > 0,

3. f(x ) maxx X\B(x ,θ)f(x) > ϵ0 for some ϵ0 > 0.

We note that all constants L1, L2 and ϵ0 are unknown. The Property 1.2 can be considered as the quadratic behavior of the objective function in the neighborhood of the optimum. This property holds for every Mat ern kernel with ν > 2 as argued by (De Freitas et al., 2012) and (Wang et al., 2014). The result closest to ours is that of IMGPO (Kawaguchi et al., 2016) that achieves an exponential regret bound e

N. However, their work requires a type of quadratic behavior in the whole search space (as represented in Assumption 2 in their paper) which is quite strong. Compared to it, our assumption is weaker which only requires the quadratic behavior of the function in a neighborhood of the optimum. Despite this, we will show that our algorithm can improve their regret bound.

5.2. Convergence Analysis

For the theoretical guarantee, we follow the principle of the optimism in the face of uncertainty as in (Munos, 2011). The basic idea is to construct the set of expandable nodes at each depth h, called the expansion set. We do this in Section 5.2.1. Quantifying the size of the expansion set is a key step in this principle. We do this in Section 5.2.2. Finally, by using upper bounds on the size of these sets, we derive the regret bounds in Section 5.2.3. All proofs are provided in the Supplementary Material.

5.2.1. THE EXPANSION SET

Deﬁnition 1. Let the expansion set at depth h be the set of all nodes that could be potentially expanded before the optimal node at depth h is selected for expansion in Algorithm 1. Formally,

Ih = {(h, i)| h p N : Up(ch,i) f(x ) δ(h; a, b)}

, where δ(h; a, b) is deﬁned as

δ(h; a, b) = L1Da 2 bh

and Up(ch,i) is the upper conﬁdence bound at the center ch,i of node (h, i) after p expansions.

We note that even though this deﬁnition uses δ(h; a, b) that depends on the unknown metric L1, our BOO algorithm does not need to know this information. The reason we use δ(h; a, b) = L1Da 2 bh

D lies in the following three observations of our partitioning procedure P(m; a, b) in the search space X. We assume here that X = [0, 1]D (this can always be achieved by scaling).

Lemma 1. Given a cell Ah,i at depth h, we have that

1. the longest side of cell Ah,i is at most a bh

2. the smallest side of cell Ah,i is at least a bh

Lemma 2. Given a cell Ah,i at depth h, then we have that supx Ah,i||x ch,i|| D1/2a bh

D , where ch,i is the center of cell Ah,i.

Proof. By Lemma 1, the longest side of a cell at depth h is at most a bh

D . Therefore, supx Ah,i||x ch,i|| p

D = D1/2a bh

We denote a node (h, i ) as the optimal node at depth h if x belongs to the cell Ah,i .

Lemma 3. At a depth h, we have that f(ch,i ) f(x ) δ(h; a, b).

Proof. By Property 1, f(x ) f(ch,i ) L1||x ch,i ||2. By Lemma 2, ||x ch,i || D1/2a bh

δ(h; a, b)/L1. Thus, f(x ) f(ch,i ) L1||x ch,i ||2 δ(h; a, b).

By Lemma 3 and the fact that Up(ch,i) f(ch,i) with high probability, we have that Up(ch,i) f(x ) δ(h; a, b) with high probability. It deduces that the global maximum x belongs to the expansion set at every depth with high probability.

The expansion set at a depth h in our approach differs from the ones in works of (Munos, 2011; Wang et al., 2014; Kawaguchi et al., 2016) which is deﬁned as Ih = {(h, i)|f(ch,i) f(x ) δ(h; a, b)}. More precisely, set Ih of their works is a smaller set than the set Ih in our work deﬁned above because we have Up(ch,i) f(ch,i) with high probability. This bigger Ih directly might involve unnecessary explorations and therefore, the algorithm may incur higher regret than that of Bam SOO and IMGPO. However, we solve this challenge by leveraging our new partitioning procedure P(m; a, b) with b = D, and some results from (Kanagawa et al., 2018; Vakili et al., 2020) which are presented in the following section.

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

5.2.2. AN UPPER BOUND ON THE SIZE OF THE

EXPANSION SET

To quantify |Ih|, we use a concept called the near-optimality dimension as in (Munos, 2011). In our context, we deﬁne the near-optimality dimension as follows: Deﬁnition 2. The near-optimality dimension is deﬁned as the smallest d 0 such that there exists C > 0 such that for any δ(h; a, b), the maximal number of disjoint balls the with largest size in a cell at depth h with center in Xδ(h;a,b) is less than C(δ(h; a, b))d, where Xδ(h;a,b) = {x X| Up(x) f(x ) δ(h; a, b)}.

With partitioning procedure P(m; a, b) where a = O(N 1/D) with b = D, we will show d = 0 (which is equivalent to prove |Ih| C) through the following Theorem 1. Theorem 1 (Bound on Expansion Set Size). Consider a partitioning procedure P(m; a, b) where a = O(N 1/D) and b = D. Then there exist constants N1 > 0 and C > 0 such that for every for N N1 and h 2, we have, |Ih| C.

To prove Theorem 1, we estimate a bound on variance function σp in terms of the function δ(h 1; a, b) through the following lemma. Lemma 4. Assuming that node (h, i) at the depth h 1 is evaluated at the p-th evaluation, where p h. Thus,

σp(ch,i) C1(δ(h 1; a, b))ν/2 D/4,

where C1 is a constant.

This lemma holds by applying some results about the closeness between the samples from a GP (in Bayesian setting) and the elements of an RKHS (in non-Bayesian setting) as in (Kanagawa et al., 2018; Vakili et al., 2020), to the structured search space as in our approach. This technique is novel compared to Ba MSOO and IMGPO s ones. We refer to our Supplementary Material in Section 3.

Using this result and the condition ν > 4+D/2, we achieve a constant bound on the size of set Ih.

5.2.3. BOUNDING THE SIMPLE REGRET

Next we use the upper bound on |Ih| at every depth h to derive a bound on the simple regret r N.

Let us use h p to denote the depth of the deepest expanded node in the branch containing x after p expansions. Similar to the lemma 2 in (Munos, 2011), we can bound the sum of |Ih| as follows. Lemma 5. Assume that f(c) U(c) for all centers c of optimal nodes at all depths 0 h hmax(p) after p expansions. Then for any depth 0 h hmax(p), whenever p hmax(p) Ph i=0 |Ii|, we have h p h.

We use Ap to denote the set of all points evaluated by the algorithm and all centers of optimal nodes of the tree Tp after p evaluations.

Lemma 6. Pick a η (0, 1). Set βp = 2log(π2p3/3η) and Lp(c) = µp(c) β1/2 p σp(c). With probability 1 η, we have Lp(c) f(c) Up(c),

for every p 1 and for every c Ap.

We now use Lemmas 3-6 to derive a simple regret for the proposed algorithm. Here, the simple regret rp after p expansions is deﬁned as rp = f(x ) max1 i pf(xi), where xi is the i-th sample.

Theorem 2 (Regret Bound). Assume that there is a partitioning procedure P(m; a, b) where a = O(N 1/D), b = D and 2 m <

N 1. Let the depth function hmax(p) = p. We consider m2 < p N, and deﬁne h(p) as the smallest integer h such that h p m 1

C + 2, where C is the constant deﬁned by Theorem 1. Pick a η (0, 1). Then for every N N1, the loss is bounded as

rp δ(min{h(p), p + 1}; a, b) +

+ 4C1β1/2 p (δ(min{h(p) 1, p}; a, b))ν/2 D/4,

with probability 1 η, where N1 is the constant deﬁned in Theorem 1, C1 is the constant deﬁned in lemma 4 and βN = p

2log(π2N 3/3η).

Proof. By Theorem 1, the deﬁnition of h(p) and the facts that |I0| = 1 and |I1| m, we have

l=0 |Il| = |I0| + |I1| + (|I2| + ... + |I|h(p) 1)

1 + m + C(h(p) 2) p

Therefore, Ph(p) 1 l=0 |Il| p. By Lemma 5 when h(p) 1 hmax(p) = p, we have h p h(p) 1. If h(p) 1 > p then h p = hmax(p) = p since the BOO algorithm does not expand nodes beyond depth hmax(p). Thus, in all cases, h p min{h(p) 1, p}.

Let (h, j) be the deepest node in Tp that has been expanded by the algorithm up to p expansions. Thus h h p. By Algorithm 1, we only expand a node when its GP-UCB value is larger than vmax which is updated at Line 10 of Algorithm 1. Thus, since the node (h, j) has been expanded, its GP-UCB value is at least as high as that of the some node (h p+1, j) at depth h p+1, such that (1) node (h p+1, o) has been evaluated at some p -th expansion before node (h, j) and (2) (h p + 1, o) argmax(h p+1,i) LUp (ch p+1,i) (see Line 4 of Algorithm 1).

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

Thus, by using Lemma 3 and Lemma 6, we can achieve with probability 1 η that f(x ) Up(ch,j) δ(h p + 1; a, b) +

2β1/2 p σp (ch p+1,o).

Further by Lemma 6, we have Up(ch,j) = µp(ch,j) + β1/2 p σp(ch,j) = Lp(ch,j) + 2β1/2 p σp(ch,j) f(ch,j) + 2β1/2 p σp(ch,j), with a probability 1 η.

Combining these two results, we have f(x ) f(ch,j) δ(h p + 1; a, b) + 2β1/2 p σp (ch p+1,o) + 2β1/2 p σp(ch,j) with a probability 1 η.

Finally, by using Lemma 4 to bound σp (ch p+1,o) and σp(ch,j) and using the fact that the function δ( ; a, b) decreases with their depths, we achieve

rp f(x ) f(ch,j)

δ(min{h(p), p + 1}; a, b) +

+ 4C1β1/2 p (δ(min{h(p) 1, p}; a, b))ν/2 D/4

with a probability 1 η. We provide the complete proof in the Supplementary Material.

Finally, we present an improved and simpler expression for the regret bound through the following corollary from Theorem 2. Corollary 1. Pick a η (0, 1). Then, there exists a constant N2 > 0 such that for every N N2, the simple regret of the proposed BOO algorithm with the partitioning procedure P(m; a, b) where a = (

N 2 ) 1 D , b = D, is bounded as r N O(N

with probability 1 η, where N is the number of sampled points.

Remark 1. A detailed expression for the regret bound of Corollary 1 is that r N O(C1L1D/2DN

N CD + 2 CD 2

D ), where C1 is a constant (deﬁned in Lemma 4) given L1 and D, and C is a constant (deﬁned in Theorem 1) given L1, L2, η and ν. This complete formula is extracted from the proof of Corollary 1 in Supplementary Material.

Remark 2. The closest result to ours is the regret bound of IMGPO which has the worst case order O(e

N). As can be seen, we have improved the regret bound. Our result improves over previous works because we leverage a large value of the branch factor m and our new partitioning procedure with b = D where all dimensions of a cell are divided.

6. Experiments

To evaluate the performance of our BOO algorithm, we performed a set of experiments involving optimisation of

three benchmark functions and three real applications. We compared our method against ﬁve baselines which have theoretical guarantees: (1) GP-EI (Bull, 2011), (2) GP-UCB (Srinivas et al., 2012), (3) SOO (Munos, 2011), (4) Ba MSOO (Wang et al., 2014), (5) IMGPO (Kawaguchi et al., 2016).

Experimental settings All implementations are in Python. For each test function, we repeat the experiments 15 times. We plot the mean and a conﬁdence bound of one standard deviation across all the runs. We used Mat ern kernel with ν = 4+(D+1)/2 which satisﬁes our assumptions, and estimated the kernel hyper-parameters automatically from data using Maximum Likelihood Estimation. All methods using GP (including GP-EI, GP-UCB, Ba MSOO, IMGPO and our method) were started from randomly initialised points to train GP. For GP-EI and GP-UCB which follow the standard BO, we used the DIRECT algorithm to maximise the acquisition functions and computed βt for GP-UCB as suggested in (Srinivas et al., 2012). For tree-based space partitioning methods, we follow their implementations to set the branch factor m. Note that these methods use a small m due to the negative correlation. SOO and Ba MSOO use m = 2 while IMGPO uses m = 3. The depth of search tree hmax(p) in SOO and Ba MSOO was set to p as suggested in (Munos, 2011; Wang et al., 2014). The parameter Ξn in IMGPO was set to 1.

Table 1. Average CPU time (in seconds) for the experiment with each test function.

Algorithm Hartmann Shekel Schwefel GP-EI 200.39 740.20 250.79 GP-UCB 880.43 1640.87 180.96 SOO 0.51 0.40 0.11 Ba MSOO 39.02 87.62 28.67 IMGPO 23.23 80.53 34.65 BOO 27.21 91.22 41.01

6.1. Optimisation of Benchmark Functions

We ﬁrst demonstrate the efﬁciency of our algorithm on standard benchmark functions: Hartmann3 (D = 3), Schwefel (D = 3) and Shekel (D = 4). The evaluation metric is the log distance to the true optimum: log10(f(x ) f +), where f + is the best function value sampled so far.

For our BOO algorithm, we choose parameters m, a, b and N as per Corollary 1 which suggests using N so that a = O((

N 2 )1/D) 2. For Hartmann3 (D = 3) and Schwefel (D = 3) we use partitioning procedure P(8; 2, 3) with N = 200. For Shekel function (D = 4), we use P(16; 2; 4) with N = 800 so that (

N 2 )1/D 2. We follow Lemma 6 in our theoretical analysis to set βp = 2log(π2p3/3η), with η = 0.05.

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

0 25 50 75 100 125 150 175 200 Iterations

Log Distance to Optima

Hartmann (D = 3)

0 25 50 75 100 125 150 175 200 Iterations

Log Distance to Optima

Schwefel (D = 3)

100 200 300 400 500 600 700 800 Iterations

Log Distance to Optima

Shekel (D = 4)

Figure 3. Comparison of methods for Hartmann3 (D = 3), Schwefel (D = 3), and Shekel (D = 4) functions.

0 25 50 75 100 125 150 175 200 Iterations

Log Prediction Error

Elastic Net

0 25 50 75 100 125 150 175 200 Iterations

Log Prediction Error

0 200 400 600 800 1000 Iterations

Log Prediction Error

Figure 4. Log prediction error on MNIST dataset for different algorithms Elastic Net, MLP and XGBoost.

Figure 3 shows the performance of our algorithm compared to the baselines. Our method outperforms all baselines for all considered synthetic functions in general with only one exceptional case of Shekel function where GP-UCB performs better our method. Compared to Ba MSOO and IMGPO which are tree-based optimisation algorithms, the efﬁciency of BOO is gained by using a large m and sampling strategies similar to BO (as shown in Section 4.2). Compared to GP-EI and GP-UCB, our algorithm takes advantage of searching a point to be evaluated at each iteration. BOO searches it only in a promising region (as done in Line 4 and 5 in Algorithm 1) rather in a whole search space. Moreover, unlike GP-EI and GP-UCB, BOO avoids the searching by optimisation at each iteration which cannot be obtained sufﬁciently and accurately given a limited computation budget.

On Computational Effectiveness Our method performs competitively against Ba MSOO and IMGPO in terms of computational effectiveness (as shown in Table 1). Our method uses a large value of m and hence it takes slightly more time to compute UCBs of all nodes. It performs slower than IMGPO but much faster than GP-EI, GP-UCB which require the maximisation of the acquisition function in a continuous space.

On Ablation Study between Function Sampling and Partitioning Procedure To show the inﬂuence of the proposed function sampling and the proposed partition procedure on BOO s performance, we have performed additional experiments with m = 64 (see Figure 5). The left plot shows different partitioning procedures while the function sampling is ﬁxed to our proposed scheme. We can see a good improvement when b = D compared to b = 1 case. The right plot compares Ba MSOO with our method which uses the proposed function sampling scheme but keeps us-

ing Ba MSOO s partitioning procedure (b = 1, m = 64). In this case, we are not able to outperform Ba MSOO. However, our result for b = D in the left plot is signiﬁcantly better than that of Ba MSOO. This clearly shows that the effect of partitioning procedure is higher than that of the function sampling.

0 25 50 75 100 125 150 175 200 Iterations

Log Distance to Optima

Hartmann6 (D = 6)

0 25 50 75 100 125 150 175 200 Iterations

Log Distance to Optima

Hartmann6 (D = 6)

Figure 5. On Ablation Study between Function Sampling and Partitioning Procedure for the Hartmann6 (D = 6) function.

6.2. Hyperparameter Tuning for Machine Learning Models

To further validate the performance of our algorithm, we tune hyperparameter tuning of three machine learning models on the MNIST dataset and Skin Segmentation dataset, then plot the log prediction error.

Elastic Net A regression method has the L1 and L2 regularisation parameters. We tune w1 and w2 where w1 > 0 expresses the magnitude of the regularisation penalty while w2 [0, 1] expresses the ratio between the two penalties. We tune w1 in the normal space while w2 is tuned in an exponent space (base 10). The search space is the domain [0, 1] [ 3, 1]. We implement the Elastic net model by using the function SGDClassiﬁer in the scikit-learn package.

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

Multilayer Perceptron (MLP) We consider a 2-layer MLP with 512 neurons/layer and optimize three hyperparameters: the learning rate l and the L2 norm regularisation parameters lr1 and lr2 of the two layers (all tuned in the exponent space (base 10)). The search space is [ 6, 1]3. The model is trained with the Adam optimizer in 20 epochs with batch size 128.

Using MNIST dataset, we train the models with this hyperparameter setting using the 55000 patterns and then test the model on the 10000 patterns. The algorithms suggests a new hyperparameter setting based on the prediction accuracy on the test dataset. We set N = 200. We use P(4; 2, 2) for Elastic Net and P(8; 2, 3) for MLP as per Corollary 1.

As seen in Figure 4, for Elastic Net, our algorithm outperforms the all baselines. For MLP, our algorithm achieves slightly lower prediction errors compared to the baselines because there is a little room to improve where the prediction error of our method for MLP attains 1.8%.

Table 2. Hyperparameters for XGBoost.

Variables Min Max learning rate 0.1 1 max depth 5 15 subsample 0.5 1 colsample 0.1 1 gamma 0 10

XGBoost classiﬁcation We demonstrate a classiﬁcation task using XGBoost (Chen & Guestrin, 2016) on a Skin Segmentation dataset 1. The Skin Segmentation dataset is plit into 15% for training and 85% for testing for a classiﬁcation problem. There are 5 hyperarameters for XGBoost which is summarized in Table 2. Our proposed BOO is the best solution, outperforming all the baselines by a wide margin.

7. Conclusion

We have presented a ﬁrst practical algorithm which can achieve an exponential regret bound with tightest order N

N for Baysian optimisation under the assumption that the objective function is sampled from a Gaussian process with a Mat ern kernel with ν > 4 + D

2 . Our partitioning procedure and the sampling strategy differ from the existing ones. We have demonstrated the beneﬁts of our algorithm on both synthetic and real world experiments. In the future we plan to extend our work to high dimensions and noisy setting.

1https://archive.ics.uci.edu/ml/datasets/skin+segmentation

Acknowledgments

This research was partially funded by the Australian Government through the Australian Research Council (ARC). Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).

Al-Dujaili, A. and Suresh, S. Embedded bandits for largescale black-box optimization. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence, February 4-9, 2017, San Francisco, California, USA, pp. 758 764, 2017.

Al-Dujaili, A. and Suresh, S. Multi-objective simultaneous optimistic optimization. Inf. Sci., 424:159 174, 2018. doi: 10.1016/j.ins.2017.09.066. URL https://doi. org/10.1016/j.ins.2017.09.066.

Bull, A. D. Convergence rates of efﬁcient global optimization algorithms. J. Mach. Learn. Res., 12:2879 2904, November 2011. ISSN 1532-4435.

Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 16, pp. 785 794, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/ 2939672.2939785. URL https://doi.org/10. 1145/2939672.2939785.

Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed bandits. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 844 853, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http://proceedings.mlr. press/v70/chowdhury17a.html.

De Freitas, N., Smola, A. J., and Zoghi, M. Exponential regret bounds for gaussian process bandits with deterministic observations. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML 12, pp. 955 962, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851.

Floudas, C. A. Deterministic Global Optimization: Theory, Methods and (NONCONVEX OPTIMIZATION AND ITS APPLICATIONS Volume 37) (Nonconvex Optimization and Its Applications). Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0792360141.

Grill, J.-B., Valko, M., Munos, R., and Munos, R. Black-box optimization of noisy functions with unknown smoothness. In Cortes, C., Lawrence, N. D., Lee, D. D.,

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 667 675. Curran Associates, Inc., 2015.

Hern andez-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. Predictive entropy search for efﬁcient global optimization of black-box functions. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 918 926. Curran Associates, Inc., 2014.

Janz, D., Burt, D., and Gonzalez, J. Bandit optimisation of functions in the mat ern kernel rkhs. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 2486 2495. PMLR, 26 28 Aug 2020.

Jones, D. R., Perttunen, C. D., and Stuckman, B. E. Lipschitzian optimization without the lipschitz constant. J. Optim. Theory Appl., 79(1):157 181, October 1993. ISSN 0022-3239.

Kanagawa, M., Hennig, P., Sejdinovic, D., and Sriperumbudur, B. K. Gaussian processes and kernel methods: A review on connections and equivalences, 2018.

Kandasamy. High dimensional bayesian optimisation and bandits via additive models. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pp. 295 304. JMLR.org, 2015.

Kawaguchi, K., Kaelbling, L. P., and Lozano-P erez, T. Bayesian optimization with exponential convergence, 2016. URL https://arxiv.org/abs/ 1604.01348.

Munos, R. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In Shawe Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 783 791. Curran Associates, Inc., 2011.

Qian, H. and Yu, Y. Scaling simultaneous optimistic optimization for high-dimensional non-convex functions with low effective dimensions. In Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, February 1217, 2016, Phoenix, Arizona, USA, pp. 2000 2006, 2016.

Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. ISBN 026218253X.

Russo, D., Roy, B. V., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 11(1):1 96, 2018. doi: 10. 1561/2200000070.

Scarlett, J., Bogunovic, I., and Cevher, V. Lower bounds on regret for noisy Gaussian process bandit optimization. In Kale, S. and Shamir, O. (eds.), Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pp. 1723 1742, Amsterdam, Netherlands, 07 10 Jul 2017. PMLR. URL http://proceedings.mlr. press/v65/scarlett17a.html.

Sen, R., Kandasamy, K., and Shakkottai, S. Multi-ﬁdelity black-box optimization with hierarchical partitions. volume 80 of Proceedings of Machine Learning Research, pp. 4538 4547, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

Shilton, A., Gupta, S., Rana, S., and Venkatesh, S. Regret Bounds for Transfer Learning in Bayesian Optimisation. In Singh, A. and Zhu, J. (eds.), Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 307 315, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR.

Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. W. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Trans. Inf. Theor., 58(5):3250 3265, May 2012. ISSN 0018-9448. doi: 10.1109/TIT.2011.2182033. URL http://dx.doi. org/10.1109/TIT.2011.2182033.

Stein, M. L. Interpolation of spatial data. Springer Series in Statistics. Springer-Verlag, New York, 1999. ISBN 0-387-98629-4. doi: 10.1007/ 978-1-4612-1494-6. URL http://dx.doi.org/ 10.1007/978-1-4612-1494-6. Some theory for Kriging.

Theckel Joy, T., Rana, S., Gupta, S., and Venkatesh, S. A ﬂexible transfer learning framework for bayesian optimization with convergence guarantee. Expert Systems with Applications, 115:656 672, 2019. ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2018.08. 023. URL https://www.sciencedirect.com/ science/article/pii/S0957417418305311.

Tran-The, H., Gupta, S., Rana, S., Ha, H., and Venkatesh, S. Sub-linear regret bounds for bayesian optimisation in unknown search spaces. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 16271 16281. Curran Associates, Inc., 2020. URL https://proceedings.

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

neurips.cc/paper/2020/file/ bb073f2855d769be5bf191f6378f7150-Paper. pdf.

Tran-The, H., Gupta, S., Rana, S., and Venkatesh, S. Trading convergence rate with computational budget in high dimensional bayesian optimization. In The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artiﬁcial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 2425 2432, 2020.

Vakili, S., Picheny, V., and Durrande, N. Regret bounds for noise-free bayesian optimization, 2020.

Valko, M., Carpentier, A., and Munos, R. Stochastic simultaneous optimistic optimization. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pp. 19 27, 2013.

Wang, L., Fonseca, R., and Tian, Y. Learning search space partition for black-box optimization using monte carlo tree search, 2020.

Wang, Z., Shakibi, B., Jin, L., and de Freitas, N. Bayesian multi-scale optimistic optimization. In Proceedings of the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, pp. 1005 1014, 2014.

Wang, Z., Gehring, C., Kohli, P., and Jegelka, S. Batched large-scale bayesian optimization in high-dimensional spaces. volume 84 of Proceedings of Machine Learning Research, pp. 745 754, Playa Blanca, Lanzarote, Canary Islands, 09 11 Apr 2018. PMLR.