# global_optimization_with_parametric_function_approximation__3d82e0b8.pdf

Global Optimization with Parametric Function Approximation

Chong Liu 1 Yu-Xiang Wang 1

We consider the problem of global optimization with noisy zeroth order oracles a wellmotivated problem useful for various applications ranging from hyper-parameter tuning for deep learning to new material design. Existing work relies on Gaussian processes or other nonparametric family, which suffers from the curse of dimensionality. In this paper, we propose a new algorithm GO-UCB that leverages a parametric family of functions (e.g., neural networks) instead. Under a realizable assumption and a few other mild geometric conditions, we show that GO-UCB achieves a cumulative regret of O(

T) where T is the time horizon. At the core of GOUCB is a carefully designed uncertainty set over parameters based on gradients that allows optimistic exploration. Synthetic and real-world experiments illustrate GO-UCB works better than popular Bayesian optimization approaches, even if the model is misspecified.

1. Introduction

We consider the problem of finding a global optimal solution to the following optimization problem

max x X f(x),

where f : X R is an unknown non-convex function that is not necessarily differentiable in x.

This problem is well-motivated by many real-world applications. For example, the accuracy of a trained neural network on a validation set is complex non-convex function of a set of hyper-parameters (e.g., learning rate, momentum, weight decay, dropout, depth, width, choice of activation functions

1Department of Computer Science, University of California, Santa Barbara, CA 93106, USA. Correspondence to: Chong Liu <chongliu@cs.ucsb.edu>, Yu-Xiang Wang <yuxiangw@cs.ucsb.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

...) that one needs to maximize (Kandasamy et al., 2020). Also in material design, researchers want to synthesize ceramic materials, e.g., titanium dioxide (Ti O2) thin films, using microwave radiation (Nakamura et al., 2017) where the film property is a non-convex function of parameters including temperature, solution concentration, pressure, and processing time. Efficiently solving such non-convex optimization problems could significantly reduce energy cost.

We assume having access to only noisy function evaluations, i.e., at round t, we select a point xt and receive a noisy function value yt,

yt = f(xt) + ηt, (1)

where ηt for t = 1, ..., T are independent, zero-mean, σsub-Gaussian noise. This is known as the noisy zeroth-order oracle setting in optimization literature. Let f be the optimal function value, following the tradition of Bayesian optimization (see e.g., Frazier (2018) for a review), throughout this paper, we use cumulative regret as the evaluation criterion, defined as

t=1 f f(xt),

where rt is called instantaneous regret at round t. An algorithm A is said to be a no-regret algorithm if lim T RT (A)/T = 0.

Generally speaking, solving a global non-convex optimization is NP-hard (Jain et al., 2017) and we need additional assumptions to efficiently proceed. Bayesian optimization usually assumes the objective function f is drawn from a Gaussian process prior. Srinivas et al. (2010) proposed the GP-UCB approach, which iteratively queries the argmax of an upper confidence bound of the current posterior belief, before updating the posterior belief using the new data point. However, Gaussian process relies on kernels, e.g., squared error kernel or Mat ern kernel, which suffer from the curse of dimensionality. A folklore rule-of-thumb is that GP-UCB becomes unwieldy when the dimension is larger than 10.

A naive approach is to passively query T data points uniformly at random, estimate f by ˆf using supervised learning, then return the maximizer of the plug-in estimator ˆx = argmaxx X ˆf(x). This may side-step the curse-ofdimensionality depending on which supervised learning

Global Optimization with Parametric Function Approximation

model we use. The drawback of this passive query model is that it does not consider the structure of the function nor does it quickly zoom-in to the region of the space that is nearly optimal. In contrast, an active query model allows the algorithm to iteratively interact with the function. At round t, the model collects information from all previous rounds 1, ..., t 1 and decides where to query next.

GO-UCB Algorithm. In this paper, we develop an algorithm that allows Bayesian optimization-style active queries to work for general supervised learning-based function approximation. We assume that the supervised learning model fw : X R is differentiable w.r.t. its dw-dimensional parameter vector w W Rdw and that the function class F = {fw|w W} is flexible enough such that the true objective function f = fw for some w W, i.e., F is realizable. Our algorithm Global Optimization via Upper Confidence Bound (GO-UCB) has two phases:

The GO-UCB Framework:

Phase I: Uniformly explore n data points.

Phase II: Optimistically explore T data points.

The goal of Phase I to sufficiently explore the function and make sure the estimated parameter ˆw0 is close enough to true parameter w such that exploration in Phase II are efficient. To solve the estimation problem, we rely on a regression oracle that is able to return an estimated ˆw0 after n observations. In details, after Phase I we have a dataset {(xj, yj)}n j=1, then

ˆw0 argmin w W

j=1 (fw(xj) yj)2. (2)

This problem is known as a non-linear least square problem. It is computationally hard in the worst-case, but many algorithms are known (e.g., SGD, Gauss-Newton, Levenberg Marquardt) to effectively solve this problem in practice. Our theoretical analysis of ˆw0 uses techniques from Nowak (2007). See Section 5.1 for details.

In Phase II, exploration is conducted following the principle of Optimism in the Face of Uncertainty , i.e., the parameter is optimized within an uncertainty region that always contains the true parameter w . Existing work in bandit algorithms provides techniques that work when fw is a linear function (Abbasi-yadkori et al., 2011) or a generalized linear function (Li et al., 2017), but no solution to general differentiable function is known. At the core of our GO-UCB is a carefully designed uncertainty ball Ballt over parameters based on gradients, which allows techniques from the linear bandit (Abbasi-yadkori et al., 2011) to be adapted for the non-linear case. In detail, the ball is defined to be centered at ˆwt the solution to a regularized online regression problem after t 1 rounds of observations. And

the radius of the ball is measured by the covariance matrix of the gradient vectors of all previous rounds. We prove that w is always trapped within the ball with high probability.

Contributions. In summary, our main contributions are:

1. We initiate the study of global optimization problem with parametric function approximation and proposed a new optimistic exploration algorithm GO-UCB.

2. Assuming realizability and other mild geometric conditions, we prove that GO-UCB converges to the global optima with cumulative regret at the order of O(

T) where T is the time horizon.

3. GO-UCB does not suffer from the curse of dimensionality like Gaussian processes-based Bayesian optimization methods. The unknown objective function f can be high-dimensional, non-convex, non-differentiable, and even discontinuous in its input domain.

4. Synthetic test function and real-world hyperparameter tuning experiments show that GO-UCB works better than all compared Bayesian optimization methods in both realizable and misspecified settings.

Technical novelties. The design of GO-UCB algorithm builds upon the work of Abbasi-yadkori et al. (2011) and Agarwal et al. (2021), but requires substantial technical novelties as we handle a generic nonlinear parametric function approximation. Specifically:

1. Lin UCB analysis (e.g., self-normalized Martingale concentration, elliptical potential lemmas (Abbasiyadkori et al., 2011; Agarwal et al., 2021)) is not applicable for nonlinear function approximation, but we showed that they can be adapted for this purpose if we can localize the learner to a neighborhood of w .

2. We identify a new set of structural assumptions under which we can localize the learner sufficiently with only O(

T) rounds of pure exploration.

3. Showing that w remains inside the parameter uncertainty ball Ballt, t [T] is challenging. We solve this problem by setting regularization centered at the initialization parameter ˆw0 and presenting novel inductive proof of a lemma showing t [T], ˆwt converges to w in ℓ2-distance at the same rate.

These new techniques could be of independent interest.

2. Related Work

Global non-convex optimization is an important problem that can be found in a lot of research communities and

Global Optimization with Parametric Function Approximation

real-world applications, e.g., optimization (Rinnooy Kan & Timmer, 1987a;b), machine learning (Bubeck et al., 2011; Malherbe & Vayatis, 2017), hyperparameter tuning (Hazan et al., 2018), neural architecture search (Kandasamy et al., 2018; Wang et al., 2020), and material discovery (Frazier & Wang, 2016).

One of the most prominent approaches to this problem is Bayesian Optimization (BO) (Shahriari et al., 2015), in which the objective function is usually modeled by a Gaussian Process (GP) (Williams & Rasmussen, 2006), so that the uncertainty can be updated under the Bayesian formalism. Among the many notable algorithms in GP-based BO (Srinivas et al., 2010; Jones et al., 1998; Bull, 2011; Frazier et al., 2009; Agrawal & Goyal, 2013; Cai & Scarlett, 2021), GP-UCB (Srinivas et al., 2010) is the closest to our paper because our algorithm also selects data points in a UCB (upper confidence bound) style but the construction of the UCB in our paper is different since we are not working with GPs. Scarlett et al. (2017) proves lower bounds on regret for noisy Gaussian process bandit optimization. GPs are highly flexible and can approximate any smooth functions, but such flexibility comes at a price to play curse of dimensionality. Most BO algorithms do not work well when d > 10. Notable exceptions include the work of Shekhar & Javidi (2018); Calandriello et al. (2019); Eriksson et al. (2019); Salgia et al. (2021); Rando et al. (2022) who designed more specialized BO algorithms for high-dimensional tasks.

Besides BO with GPs, other nonparametric families were considered for global optimization tasks, but they, too, suffer from the curse of dimensionality. We refer readers to Wang et al. (2018) and the references therein.

While most BO methods use GP as surrogate models, there are other BO methods that use alternative function classes such as neural networks (Snoek et al., 2015; Springenberg et al., 2016). These methods are different from us in that they use different ways to fit the neural networks and a Monte Carlo sampling approach to decide where to explore next. Empirically, it was reported that they do not outperform advanced GP-based methods that use trust regions (Eriksson et al., 2019).

Our problem is also connected to the bandits literature (Li et al., 2019; Foster & Rakhlin, 2020; Russo & Van Roy, 2013; Filippi et al., 2010). The global optimization problem can be written as a nonlinear bandits problem in which queried points are actions and the function evaluations are rewards. However, no bandits algorithms can simultaneously handle an infinite action space and a generic nonlinear reward function. Here generic means the reward function is much more general than a linear or generalized linear function (Filippi et al., 2010). To the best of our knowledge, we are the first to address the infinite-armed bandit problems with a general differentiable value function (albeit

with some additional assumptions).

A recent line of work studied bandits and global optimization with neural function approximation (Zhou et al., 2020; Zhang et al., 2020; Dai et al., 2022). The main difference from us is that these results still rely on Gaussian processes with a Neural Tangent Kernel in their analysis, thus intrinsically linear. Their regret bounds also require the width of the neural network to be much larger than the number of samples to be sublinear. In contrast, our results apply to general nonlinear function approximations and do not require overparameterization.

3. Preliminaries

3.1. Notations

We use [n] to denote the set {1, 2, ..., n}. The algorithm queries n points in Phase I and T points in Phase II. Let X Rdx and Y R denote the domain and range of f, and W [0, 1]dw denote the parameter space of a family of functions F := {fw : X Y|w W}. For convenience, we denote the bivariate function fw(x) by fx(w) when w is the variable of interest. fx(w) and 2fx(w) denote the gradient and Hessian of function f w.r.t. w. L(w) := Ex U(fx(w) fx(w ))2 denotes the (expected) risk function where U is uniform distribution. For a vector x, its ℓp norm is denoted by x p = (Pd i=1 |xi|p)1/p

for 1 p < and its ℓ norm is denoted by x = maxi [dx] |xi|. For a matrix A, its operator norm is denoted by A op. For a vector x and a square matrix A, define x 2 A = x Ax. Throughout this paper, we use standard big O notation that hide universal constants; and to improve the readability, we use O to hide all logarithmic factors as well as all polynomial factors in problem-specific parameters except dw, 1/µ, T. For reader s easy reference, we list all symbols and notations in Appendix A.

3.2. Assumptions

Here we list main assumptions that we will work with throughout this paper. The first assumption says that we have access to a differentiable function family that contains the unknown objective function.

Assumption 3.1 (Realizability). There exists w W such that the unknown objective function f = fw . Also, assume W [0, 1]dw. This is w.l.o.g. for any compact W.

Realizable parameter class is a common assumption in literature (Chu et al., 2011; Foster et al., 2018; Foster & Rakhlin, 2020), usually the starting point of a line of research for a new problem because one doesn t need to worry about extra regret incurred by misspecified parameter. Although in this paper we only theoretically study the realizable parameter class, our GO-UCB algorithm empirically works well in

Global Optimization with Parametric Function Approximation

Expected loss function Strong convexity boundary Growth condition boundary

Figure 1. Example of a highly non-convex L(w) satisfying Assumption 3.3. Solid lines denote the actual lower bound by taking min over strong convexity and growth condition. L(w) is strongly convex near w but can be highly non-convex away from w .

misspecified tasks too.

The second assumption is on properties of the function approximation.

Assumption 3.2 (Bounded, differentiable and smooth function approximation). There exist constants F, Cg, Ch > 0 such that x X, w W, it holds that |fx(w)| F,

fx(w) 2 Cg, and 2fx(w) op Ch.

This assumption imposes mild regularity conditions on the smoothness of the function with respect to its parameter vector w.

The third assumption is on the expected loss function over the uniform distribution (or any other exploration distribution) in the Phase I of GO-UCB.

Assumption 3.3 (Geometric conditions on the loss function). L(w) = Ex U(fx(w) fx(w ))2 satisfies (τ, γ)-growth condition or µ-local strong convexity at w , i.e., w W,

2 w w 2 2, τ

2 w w γ 2 o L(w) L(w ),

for constants µ, τ > 0, µ < dw and 0 < γ < 2. Also, L(w) satisfies a c-local self-concordance assumption at w , i.e., for all w s.t. w w 2L(w ) c,

(1 c)2 2L(w ) 2L(w) (1 c) 2 2L(w ).

We also assume c 0.5 for convenience. This is without loss of generality because if the condition holds for c > 0.5, then the condition for c 0.5 is automatically satisfied.

This assumption has three main components: (global) growth condition, local strong convexity, and local selfconcordance.

The global growth condition says that fw with parameters far away from w cannot approximate f well over the distribution U. The local strong convexity assumption requires the local neighborhood near w to have quadratic growth.

These two conditions are strictly weaker than global strong convexity because it does not require convexity except in a local neighborhood near the global optimal w , i.e., {w| w w 2 (τ/µ) 1 2 γ } and it does not limit the number of spurious local minima, as the global γ-growth condition only gives a mild lower bound as w moves away from w . See Figure 1 for an example. Our results work even if γ is a small constant < 1.

Self-concordance originates from a clean analysis of Newton s method (Nesterov & Nemirovskii, 1994). See Example 4 of Zhang et al. (2017) for a list of examples satisfying self-concordance. A localized version of self-concordance is needed in our problem for technical reasons, but again it is only required within a small ball of radius c near w

for the expected loss under U. Our results work even if c vanishes at O(T 1/4).

To avoid any confusion, the three assumptions we made above are only about the expected loss function w.r.t. uniform distribution U as a function of w, rather than objective function fw (x). The problem to be optimized can still be arbitrarily complex in terms of X, e.g., high-dimensional and non-continuous functions. As an example, in Gaussian process-based Bayesian optimization approaches, fw (x) belongs to a reproducing kernel Hilbert space, but its loss function is globally convex in its infinite dimensional parameter w. Also, we no longer need this assumption in Phase II.

Additional notations. For convenience, we define ζ > 0 such that 2L(w ) op ζ. The existence of a finite ζ is implied by Assumption 3.2 and it suffices to take ζ = 2C2 g because 2L(w ) = Ex U[2 fx(w ) fx(w ) ].

4. Main Results

In Section 4.1, we state our Global Optimization with Upper Confidence Bound (GO-UCB) algorithm and explain key design points of it. Then in Section 4.2, we prove that its cumulative regret bound is at the rate of O(

4.1. Algorithm

Our GO-UCB algorithm, shown in Algorithm 1, has two phases. Phase I does uniform exploration in n rounds and Phase II does optimistic exploration in T rounds. In Step 1 of Phase I, n is chosen to be large enough such that the objective function can be sufficiently explored. Step 2-3 are doing uniform sampling. In Step 5, we call regression oracle to estimate ˆw0 given all observations in Phase I as in eq. (2). Adapted from Nowak (2007), we prove the convergence rate of ˆw0 w 2 is at the rate of O(1/ n). See Theorem 5.2 for details.

The key challenge of Phase II of GO-UCB is to design

Global Optimization with Parametric Function Approximation

an acquisition function to select xt, t [T]. Since we are using parametric function to approximate the objective function, we heavily rely on a feasible parameter uncertainty region Ballt, t [T], which should always contain the true parameter w throughout the process. The shape of Ballt is measured by the covariance matrix Σt, defined as

i=0 fxi( ˆwi) fxi( ˆwi) . (3)

Note i is indexing over both x and w, which means that as time t goes from 0 to T, the update to Σt is always rank one. It allows us to bound the change of Σt from t = 0 to T.

Ballt is centered at ˆwt, the newly estimated parameter at round t. In Step 2, we update the estimated ˆwt by solving the following optimization problem:

ˆwt = argmin w λ 2 w ˆw0 2 2

i=0 ((w ˆwi) fxi( ˆwi) + fxi( ˆwi) yi)2. (4)

The optimization problem is an online regularized least square problem involving gradients from all previous rounds, i.e., fxi( ˆwi), i [T]. The intuition behind it is that we use gradients to approximate the function since we are dealing with generic objective function. We set the regularization w.r.t. ˆw0 rather than 0 because from regression oracle we know how close is ˆw0 to w . By setting the gradient of objective function in eq. (4) to be 0, the closed form

Algorithm 1 GO-UCB Input: Time horizon T, uniform exploration phase length n, uniform distribution U, regression oracle Oracle, regularization weight λ, confidence sequence βt for t = 1, 2, ..., T. Phase I (Uniform exploration)

1: for j = 1, ..., n do 2: Sample xj U(X). 3: Observe yj = f(xj) + ηj. 4: end for 5: Estimate ˆw0 Oracle(x1, y1, ..., xn, yn). Phase II (Optimistic exploration)

1: for t = 1, ..., T do 2: Update Σt by eq. (3) with the input λ. 3: Update ˆwt by eq. (5) with the input λ. 4: Update Ballt by eq. (6) with the input βt. 5: Select xt = argmaxx X maxw Ballt fx(w). 6: Observe yt = f(xt) + ηt. 7: end for Output: ˆx U({x1, ..., x T }).

solution of ˆwt is

ˆwt = Σ 1 t

i=0 fxi( ˆwi)( fxi( ˆwi) ˆwi + yi fxi( ˆwi))

+ λΣ 1 t ˆw0. (5)

Now we move to our definition of Ballt, shown as

Ballt = {w : w ˆwt 2 Σt βt}, (6)

where βt is a pre-defined monotonically increasing sequence that we will specify later. Following the optimism in the face of uncertainty idea, our ball is centered at ˆwt with βt being the radius and Σt measuring the shape. βt ensures that the true parameter w is always contained in Ballt w.h.p. In Section 5.2, we will show that it suffices to choose

βt = O dwσ2 + d3 w µ2 + d3 wt µ2T

where O hides logarithmic terms in t, T and 1/δ (w.p. 1 δ).

Then in Step 5 of Phase II, xt is selected by joint optimization over x X and w Ballt. Finally, we collect all observations in T rounds and output ˆx by uniformly sampling over {x1, ..., x T }.

4.2. Regret Upper Bound

Now we present the cumulative regret upper bound of GOUCB algorithm.

Theorem 4.1 (Cumulative regret of GO-UCB). Suppose Assumption 3.1, 3.2, & 3.3 hold with parameters F, Cg, Ch, ζ, µ, γ, τ, c. Assume

T > Cd2 w F 4ι2 max µγ/(2 γ)

τ 2/(2 γ) , ζ

where C is a universal constant and ι is a logarithmic term depending on n, Ch, 2/δ (both of them from Theorem 5.2). Then Algorithm 1 with parameters n =

T (for a Cλ logarithmically dependent to T and polynomial in all other parameters) and β1:T as in eq. (7) obeys that with probability at least 1 δ,

TβT dw + Tβ2 T λ2

Let us highlight a few interesting aspects of the result. Remark 4.2. Without Gaussian process assumption, we propose the first algorithm to solve global optimization problem

Global Optimization with Parametric Function Approximation

T) cumulative regret, which is dimension-free in terms of its input domain X. GO-UCB is a no-regret algorithm since lim T RT /T = 0, and the output ˆx satisfies that f E[f(ˆx)] O(1/

T), which is also knowns as expected simple regret upper bound. The dependence in T is optimal up to logarithmic factors, as it matches the lower bound for linear bandits (Dani et al., 2008, Theorem 3). Remark 4.3 (Choice of λ). One important deviation from the classical linear bandit analysis is that we require a regularization that centers around ˆw0 and the regularization weight λ to be Cλ

T, comparing to λ = O(1) in the linear case. The choice is to ensure that ˆwt stays within the local neighborhood of ˆw0, and to delicately balance different terms that appear in the regret analysis to ensure that the overall regret bound is O(

T). Remark 4.4 (Choice of n). We choose n =

T, therefore, it puts sample complexity requirement on T shown in eq. (8). The choice of n plays two roles here. First, it guarantees that the regression result ˆw0 lies in the neighboring region of w of the loss function L(w) with high probability. The neighboring region of w has nice properties, e.g., local strong convexity, which allow us to build the upper bound of ℓ2-distance between ˆw0 and w . Second, in Phase I, we are doing uniform sampling over the function so the cumulative regret in Phase I is bounded by 2Fn = 2F

T which is at the same O(

T) rate as that in Phase II.

5. Proof Overview

In this section, we give a proof sketch of all theoretical results. A key insight of our analysis is that there is more mileage that seminal techniques developed by Abbasiyadkori et al. (2011) for analyzing linearly parameterized bandits problems in analyzing non-linear bandits, though we need to localize to a nearly optimal region and carefully handle the non-linear components via more aggressive regularization. Other assumptions that give rise to a similarly good initialization may work too and our new proof can be of independent interest in analyzing other extensions of Lin UCB, e.g., to contextual bandits, reinforcement learning and other problems.

In detail, first we prove the estimation error bound of ˆw0 for Phase I of GO-UCB algorithm, then prove the feasibility of Ballt. Finally by putting everything together we prove the cumulative regret bound of GO-UCB algorithm. Due to page limit, we list all auxiliary lemmas in Appendix B and show complete proofs in Appendix C.

5.1. Regression Oracle Guarantee

The goal of Phase I of GO-UCB is to sufficiently explore the unknown objective function with n uniform queries and obtain an estimated parameter ˆw0. By assuming access to a regression oracle, we prove the convergence bound of ˆw0

w.r.t. w , i.e., ˆw0 w 2 2. To get started, we need the following regression oracle lemma.

Lemma 5.1 (Adapted from Nowak (2007))). Suppose Assumption 3.1 & 3.2 hold. There is an absolute constant C , such that after round n in Phase I of Algorithm 1, with probability > 1 δ/2, regression oracle estimated ˆw0 satisfies

Ex U[(fx( ˆw0) fx(w ))2] C dw F 2ι

where ι is the logarithmic term depending on n, Ch, 2/δ.

Nowak (2007) proves that expected square error of Empirical Risk Minimization (ERM) estimator can be bounded at the rate of O(1/n) with high probability, rather than O(1/ n) rate achieved by Chernoff/Hoeffding bounds. It works with realizable and misspecified settings. Proof of Lemma 5.1 includes simplifying it with regression oracle, Assumption 3.1, and ε-covering number argument over parameter class. Basically Lemma 5.1 says that expected square error of fx( ˆw0) converges to fx(w ) at the rate of O(1/n) with high probability. Based on it, we prove the following regression oracle guarantee.

Theorem 5.2 (Regression oracle guarantee). Suppose Assumption 3.1, 3.2, & 3.3 hold. There is an absolute constant C such that after round n in Phase I of Algorithm 1 where n satisfies n Cdw F 2ι max n µγ/(2 γ)

τ 2/(2 γ) , ζ µc2 o , with prob-

ability > 1 δ/2, regression oracle estimated ˆw0 satisfies

ˆw0 w 2 2 Cdw F 2ι

where ι is the logarithmic term depending on n, Ch, 2/δ.

Compared with Lemma 5.1, there is an extra sample complexity requirement on n because we need n to be sufficiently large such that the function can be sufficiently explored and more importantly ˆw0 falls into the neighboring region (strongly convex region) of w . See Figure 1 for illustration. It is also the reason why strong convexity parameter µ appears in the denominator of the upper bound.

5.2. Feasibility of Ballt

The following lemma is the key part of algorithm design of GO-UCB. It says that our definition of Ballt is appropriate, i.e., throughout all rounds in Phase II, w is contained in Ballt with high probability.

Lemma 5.3 (Feasibility of Ballt). Set Σt, ˆwt as in eq. (3), (5). Set βt as

βt = O dwσ2 + d3 w µ2 + d3 wt µ2T

Suppose Assumption 3.1, 3.2, & 3.3 hold and choose n =

T. Then t [T] in Phase II of Algorithm 1,

Global Optimization with Parametric Function Approximation

w.p. > 1 δ,

ˆwt w 2 Σt βt.

For reader s easy reference, we write our choice of βt again in eq. (9). Note this lemma requires careful choices of λ and n because βt appears later in the cumulative regret bound and βt is required to be at the rate of O(1). The proof has three steps. First we obtain the closed form solution of ˆwt as in eq. (5). Next we use induction to prove that t [T], ˆwt w 2 2 O( C/n) for some universal constant C. Finally we prove ˆwt w 2 Σt βt.

5.3. Regret Analysis

To prove cumulative regrets bound of GO-UCB algorithm, we need following two lemmas of instantaneous regrets in Phase II of GO-UCB.

Lemma 5.4 (Instantaneous regret bound). Set Σt, ˆwt, βt as in eq. (3), (5), & (7) and suppose Assumption 3.1, 3.2, & 3.3 hold, then with probability > 1 δ, w is contained in Ballt. Define ut = fxt( ˆwt) Σ 1 t , then t [T] in Phase II of Algorithm 1,

βtut + 2βt Ch

The first term of the upper bound is pretty standard, seen also in Lin UCB (Abbasi-yadkori et al., 2011) and GP-UCB (Srinivas et al., 2010). After we apply first order gradient approximation of the objective function, the second term is the upper bound of the high order residual term, which introduces extra challenge to derive the upper bound.

Technically, proof of Lemma 5.4 requires w is contained in our parameter uncertainty ball Ballt with high probability throughout Phase II of GO-UCB, which has been proven in Lemma 5.3. Later, the proof utilizes Taylor s theorem and uses the convexity of Ballt twice. See Appendix C.4. The next lemma is an extension of Lemma 5.4, where the proof uses monotonically increasing property of βt in t.

Lemma 5.5 (Summation of squared instantaneous regret bound). Set Σt, ˆwt, βt as in eq. (3), (5), & (7) and suppose Assumption 3.1, 3.2, & 3.3 hold, then with probability > 1 δ, w is contained in Ballt and t [T] in Phase II of Algorithm 1,

t=1 r2 t 16βT dw log

1 + TC2 g dwλ

+ 8β2 T C2 h T λ2 .

Proof of Theorem 4.1 follows by putting everything together

via Cauchy-Shwartz inequality PT t=1 rt q

T PT t=1 r2 t .

6. Experiments

We compare our GO-UCB algorithm with four Bayesian Optimization (BO) algorithms: GP-EI (Jones et al., 1998), GP-PI (Kushner, 1964), GP-UCB (Srinivas et al., 2010), and Trust Region BO (Tu RBO) (Eriksson et al., 2019), where the first three are classical methods and Tu RBO is a more advanced algorithm designed for high-dimensional cases.

To run GO-UCB, we choose our parametric function model ˆf to be a two linear layer neural network with sigmoid function being the activation function:

ˆf(x) = linear2(sigmoid(linear1(x))),

where w1, b1 denote the weight and bias of linear1 layer and w2, b2 denote those of linear2 layer. Specifically, we set w1 R25 dx, b1 R25, w2 R25, b2 R, meaning the dimension of activation function is 25. All implementations are based on Bo Torch framework (Balandat et al., 2020) and sklearn package (Head et al., 2021) with default parameter settings. To help readers reproduce our results, implementation details are shown in Appendix D.1.

6.1. Synthetic Experiments

First, we test all algorithms on three high-dimensional synthetic functions defined on [ 5, 5]dx where dx = 20, including both realizable and misspecified cases. The first test function f1 is created by setting all elements in w1, b1, w2, b2 in ˆf to be 1, so f1 is a realizable function given ˆf. The second and third test functions f2, f3 are Styblinski-Tang function and Rastrigin function, defined as:

i=1 x4 i 16x2 i + 5xi,

i=1 10 cos(2πxi) x2 i ,

where xi denotes the i-th element in its 20 dimensions, so f2, f3 are misspecified functions given ˆf. We set n = 5, T = 25 for f1 and n = 8, T = 64 for f2, f3. To reduce the effect of randomness in all algorithms, we repeat the whole optimization process for 5 times for all algorithms and report mean and error bar of cumulative regrets. The error bar is measured by Wald s test with 95% confidence, i.e., 1.96ν/

5 where ν is standard deviation of cumulative regrets and 5 is the number of repetitions.

From Figure 2, we learn that in all tasks our GO-UCB algorithm performs better than all other four BO approaches. Among BO approches, Tu RBO performs the best since it is specifically designed for high-dimensional tasks. In Figure 2(a), mean of cumulative regrets of GO-UCB and Tu RBO stays the same when t 22, which means that both of them have found the global optima, but GO-UCB algorithm is

Global Optimization with Parametric Function Approximation

0 5 10 15 20 25 30 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(a) f1 (realizable)

0 10 20 30 40 50 60 70 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(b) f2 (misspecified)

0 10 20 30 40 50 60 70 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(c) f3 (misspecified)

Figure 2. Cumulative regrets (the lower the better) of all algorithms on 20-dimensional f1, f2, f3 synthetic functions.

0 10 20 30 40 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(a) Random forest (dx = 7)

0 10 20 30 40 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(b) Multi-layer perceptron (dx = 8)

0 10 20 30 40 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(c) Gradient boosting (dx = 11)

Figure 3. Cumulative regrets (the lower the better) of all algorithms in real-world hyperparameter tuning task on Breast-cancer dataset.

able to find the optimal point shortly after Phase I and enjoys the least error bar. It is well expected since f1 is a realizable function for ˆf. Unfortunately, GP-UCB, GP-EI, and GP-PI incur almost linear regrets, showing the bad performances of classical BO algorithms in high-dimensional cases.

In Figure 2(b) and 2(c), all methods are suffering from linear regrets because f2, f3 are misspecified functions. The gap between GO-UCB and other methods is smaller in Figure 2(c) than in 2(b) because optimizing f3 is more challenging than f2 since f3 has more local optimal points.

6.2. Real-World Experiments

To illustrate the GO-UCB algorithm works in real-world tasks, we do hyperparameter tuning experiments on three tasks using three classifiers. Three UCI datasets (Dua & Graff, 2017) are Breat-cancer, Australian, and Diabetes, and three classifiers are random forest, multi-layer perceptron, and gradient boosting where each of them has 7, 8, 11 hyperparameters. For each classifier on each dataset, the function mapping from hyperparameters to classification accuracy is the black-box function that we are maximizing, so the input space dimension dx = 7, 8, 11 for each classifier. We use cumulative regret to evaluate hyperparameter tuning performances, however, best accuracy f is unknown ahead of time so we set it to be the best empirical accuracy of each

task. To reduce the effect of randomness, we divide each dataset into 5 folds and every time use 4 folds for training and remaining 1 fold for testing. We report mean and error bar of cumulative regrets where error bar is measured by Wald s test, the same as synthetic experiments.

Figure 3 shows results on Breast-cancer dataset. In Figure 3(b)(c) GO-UCB performs statistically much better that all other BO algorithms since there is almost no error bar gap between Tu RBO and GO-UCB. It shows that GO-UCB can be deployed in real-world applications to replace BO methods. Also, in Figure 3(b) performance of GO-UCB Phase I is not good but GO-UCB can still perform better than others in Phase II, which shows the effectiveness of Phase II of GO-UCB. In Figure 3(a) all algorithms have similar performances. In Figure 3(b), Tu RBO performs similarly as GP-UCB, GP-EI, and GP-PI when t 23, but after t = 23 it performs better and shows a curved regret line by finding optimal points. Due to page limit, results on Australian and Diabetes datasets are shown in Appendix D.2 where similar algorithm performances can be seen.

Note in experiments, we choose parametric model ˆf to be a two linear layer neural network. In more real-world experiments, one can choose the model ˆf in GO-UCB to be simpler functions or much more complex functions, e.g., deep neural networks, depending on task requirements.

Global Optimization with Parametric Function Approximation

7. Conclusion

Global non-convex optimization is an important problem that widely exists in many real-world applications, e.g., deep learning hyper-parameter tuning and new material design. However, solving this optimization problem in general is NP-hard. Existing work relies on Gaussian process assumption, e.g., Bayesian optimization, or other non-parametric family which suffers from the curse of dimensionality.

We propose the first algorithm to solve such global optimization with parametric function approximation, which shows a new way of global optimization. GO-UCB first uniformly explores the function and collects a set of observation points and then uses the optimistic exploration to actively select points. At the core of GO-UCB is a carefully designed uncertainty set over parameters based on gradients that allows optimistic exploration. Under realizable parameter class assumption and a few mild geometric conditions, our theoretical analysis shows that cumulative regret of GO-UCB is at the rate of O(

T), which is dimension-free in terms of function domain X. Our high-dimensional synthetic test shows that GO-UCB works better than BO methods even in misspecified setting. Moreover, GO-UCB performs better than BO algorithms in real-world hyperparameter tuning tasks, which may be of independent interest.

There is µ, the strongly convexity parameter, in the denominator of upper bound in Theorem 4.1. µ can be small in practice, thus the upper bound can be large. Developing the cumulative regret bound containing a term depending on µ but being independent to T remains a future problem.

Acknowledgments

The work is partially supported by NSF Awards #1934641 and #2134214. We thank Ming Yin and Dan Qiao for helpful discussion and careful proofreading of an early version of the manuscript, as well as Andrew G. Wilson for pointing out that Bayesian optimization methods do not necessarily use Gaussian processes as surrogate models. Finally, we thank ICML reviewers and the area chair for their valuable input that led to improvements to the paper.

Abbasi-yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24 (Neur IPS 11), 2011.

Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. Reinforcement learning: Theory and algorithms, 2021.

Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning (ICML 13), 2013.

Balandat, M., Karrer, B., Jiang, D., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. Botorch: a framework for efficient monte-carlo bayesian optimization. In Advances in Neural Information Processing Systems 33 (Neur IPS 20), 2020.

Bubeck, S., Munos, R., Stoltz, G., and Szepesv ari, C. Xarmed bandits. Journal of Machine Learning Research, 12(46):1655 1695, 2011.

Bull, A. D. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(10), 2011.

Cai, X. and Scarlett, J. On lower bounds for standard and robust gaussian process bandit optimization. In International Conference on Machine Learning (ICML 21), 2021.

Calandriello, D., Carratino, L., Lazaric, A., Valko, M., and Rosasco, L. Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory (COLT 19), 2019.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics (AISTATS 11), 2011.

Dai, Z., Shu, Y., Low, B. K. H., and Jaillet, P. Samplethen-optimize batch neural Thompson sampling. In Advances in Neural Information Processing Systems 35 (Neur IPS 22), 2022.

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory (COLT 08), 2008.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., and Poloczek, M. Scalable global optimization via local bayesian optimization. In Advances in neural information processing systems 32 (Neur IPS 19), 2019.

Filippi, S., Cappe, O., Garivier, A., and Szepesv ari, C. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems 23 (Neur IPS 10), 2010.

Foster, D. and Rakhlin, A. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning (ICML 20), 2020.

Foster, D., Agarwal, A., Dudik, M., Luo, H., and Schapire, R. Practical contextual bandits with regression oracles. In International Conference on Machine Learning (ICML 18), 2018.

Global Optimization with Parametric Function Approximation

Frazier, P., Powell, W., and Dayanik, S. The knowledgegradient policy for correlated normal beliefs. INFORMS Journal on Computing, 21(4):599 613, 2009.

Frazier, P. I. A tutorial on bayesian optimization. ar Xiv preprint ar Xiv:1807.02811, 2018.

Frazier, P. I. and Wang, J. Bayesian optimization for materials design. In Information science for materials discovery and design, pp. 45 75. Springer, 2016.

Hazan, E., Klivans, A., and Yuan, Y. Hyperparameter optimization: a spectral approach. In International Conference on Learning Representations (ICLR 18), 2018.

Head, T., Kumar, M., Nahrstaedt, H., Louppe, G., and Shcherbatyi, I. scikit-optimize. https:// scikit-optimize.github.io, 2021.

Jain, P., Kar, P., et al. Non-convex optimization for machine learning. Foundations and Trends in Machine Learning, 10(3-4):142 363, 2017.

Jones, D. R., Schonlau, M., and Welch, W. J. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455 492, 1998.

Kandasamy, K., Neiswanger, W., Schneider, J., Poczos, B., and Xing, E. P. Neural architecture search with bayesian optimisation and optimal transport. In Advances in neural information processing systems 31 (Neur IPS 18), 2018.

Kandasamy, K., Vysyaraju, K. R., Neiswanger, W., Paria, B., Collins, C. R., Schneider, J., Poczos, B., and Xing, E. P. Tuning hyperparameters without grad students: Scalable and robust bayesian optimisation with dragonfly. Journal of Machine Learning Research, 21(81):1 27, 2020.

Kushner, H. J. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 8(1):97 106, 1964.

Li, L., Lu, Y., and Zhou, D. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning (ICML 17), 2017.

Li, Y., Wang, Y., and Zhou, Y. Nearly minimax-optimal regret for linearly parameterized bandits. In Annual Conference on Learning Theory (COLT 19), 2019.

Malherbe, C. and Vayatis, N. Global optimization of lipschitz functions. In International Conference on Machine Learning (ICML 17), 2017.

Nakamura, N., Seepaul, J., Kadane, J. B., and Reeja-Jayan, B. Design for low-temperature microwave-assisted crystallization of ceramic thin films. Applied Stochastic Models in Business and Industry, 33(3):314 321, 2017.

Nesterov, Y. and Nemirovskii, A. Interior-point polynomial algorithms in convex programming. SIAM, 1994.

Nowak, R. D. Lecture notes: Complexity regularization for squared error loss, 2007. URL https://nowak.ece. wisc.edu/SLT07/lecture12.pdf.

Rando, M., Carratino, L., Villa, S., and Rosasco, L. Adabkb: Scalable gaussian process optimization on continuous domains by adaptive discretization. In International Conference on Artificial Intelligence and Statistics (AISTATS 22), 2022.

Rinnooy Kan, A. and Timmer, G. T. Stochastic global optimization methods part i: Clustering methods. Mathematical programming, 39(1):27 56, 1987a.

Rinnooy Kan, A. and Timmer, G. T. Stochastic global optimization methods part ii: Multi level methods. Mathematical Programming, 39(1):57 78, 1987b.

Russo, D. and Van Roy, B. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems 26 (Neur IPS 13), 2013.

Salgia, S., Vakili, S., and Zhao, Q. A domain-shrinking based bayesian optimization algorithm with orderoptimal regret performance. In Advances in Neural Information Processing Systems 34 (Neur IPS 21), 2021.

Scarlett, J., Bogunovic, I., and Cevher, V. Lower bounds on regret for noisy gaussian process bandit optimization. In Annual Conference on Learning Theory (COLT 17), 2017.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104 (1):148 175, 2015.

Shekhar, S. and Javidi, T. Gaussian process bandits with adaptive discretization. Electronic Journal of Statistics, 12(2):3829 3874, 2018.

Sherman, J. and Morrison, W. J. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Annals of Mathematical Statistics, 21(1): 124 127, 1950.

Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R. Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning (ICML 15), 2015.

Springenberg, J. T., Klein, A., Falkner, S., and Hutter, F. Bayesian optimization with robust bayesian neural networks. In Advances in Neural Information Processing Systems 29 (Neur IPS 16), 2016.

Global Optimization with Parametric Function Approximation

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design. In International Conference on Machine Learning (ICML 10), 2010.

Wang, L., Fonseca, R., and Tian, Y. Learning search space partition for black-box optimization using monte carlo tree search. In Advances in Neural Information Processing Systems 33 (Neur IPS 20), 2020.

Wang, Y., Balakrishnan, S., and Singh, A. Optimization of smooth functions with noisy observations: Local minimax rates. In Advances in Neural Information Processing Systems 31 (Neur IPS 18), 2018.

Williams, C. K. and Rasmussen, C. E. Gaussian processes for machine learning. MIT Press, 2006.

Zhang, L., Yang, T., Yi, J., Jin, R., and Zhou, Z.-H. Improved dynamic regret for non-degenerate functions. In Advances in Neural Information Processing Systems 30 (Neur IPS 17), 2017.

Zhang, W., Zhou, D., Li, L., and Gu, Q. Neural thompson sampling. In International Conference on Learning Representations (ICLR 20), 2020.

Zhou, D., Li, L., and Gu, Q. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning (ICML 20), 2020.

Global Optimization with Parametric Function Approximation

A. Notation Table

Table 1. Symbols and notations. Symbol Definition Description A op operator norm Ballt eq. (6) parameter uncertainty region at round t βt eq. (7) parameter uncertainty region radius at round t µ local strong convexity parameter c local self-concordance parameter C, ζ constants dx domain dimension dw parameter dimension δ failure probability ε covering number discretization distance η σ-sub-Gaussian observation noise fw(x) objective function at x parameterized by w fx(w) objective function at w parameterized by x fx(w) 1st order derivative w.r.t. w parameterized by x 2fx(w) 2nd order derivative w.r.t. w parameterized by x F function range constant bound γ, τ growth condition parameters ι, ι , ι logarithmic terms L(w) E[(fx(w) fx(w ))2] expected loss function λ regularization parameter n time horizon in Phase I [n] {1, 2, ..., n} integer set of size n Oracle regression oracle rt fw (x ) fw (xt) instantaneous regret at round t RT PT t=1 rt cumulative regret after round T Σt eq. (3) covariance matrix at round t T time horizon in Phase II U uniform distribution w w W function parameter w w W true parameter ˆw0 oracle-estimated parameter after Phase I ˆwt eq. (5) updated parameter at round t W W [0, 1]dw parameter space x x X data point x optimal data point x (Pd i=1 |xi|p)1/p ℓp norm x p maxi [d] |xi| ℓ norm x A

x Ax distance defined by square matrix A X X Rdx function domain Y Y = [ F, F] function range

B. Auxiliary Technical Lemmas

In this section, we list auxiliary lemmas that are used in proofs.

Lemma B.1 (Adapted from eq. (5) (6) of Nowak (2007)). Given a dataset {xi, yi}n j=1 where yj is generated from eq. (1) and f0 is the underlying true function. Let ˆf be an ERM estimator taking values in F where F is a finite set and

Global Optimization with Parametric Function Approximation

F {f : [0, 1]d [ F, F]} for some F 1. Then with probability > 1 δ, ˆf satisfies that

E[( ˆf f0)2] 1 + α

inf f F E[(f f0)2] + F 2 log(|F|) log(2)

+ 2 log(2/δ)

for all α (0, 1].

Lemma B.2 (Sherman-Morrison lemma (Sherman & Morrison, 1950)). Let A denote a matrix and b, c denote two vectors. Then

(A + bc ) 1 = A 1 A 1bc A 1

1 + c A 1b .

Lemma B.3 (Self-normalized bound for vector-valued martingales (Abbasi-yadkori et al., 2011; Agarwal et al., 2021)). Let {ηi} i=1 be a real-valued stochastic process with corresponding filtration {Fi} i=1 such that ηi is Fi measurable, E[ηi|Fi 1] = 0, and ηi is conditionally σ-sub-Gaussian with σ R+. Let {Xi} i=1 be a stochastic process with Xi H (some Hilbert space) and Xi being Ft measurable. Assume that a linear operator Σ : H H is positive definite, i.e., x Σx > 0 for any x H. For any t, define the linear operator Σt = Σ0 + Pt i=1 Xi X i (here xx denotes outer-product in H). With probability at least 1 δ, we have for all t 1:

σ2 log det(Σt) det(Σ0) 1

C. Missing Proofs

In this section, we show complete proofs of all technical results in the main paper. For reader s easy reference, we define ι as a logarithmic term depending on n, Ch, 2/δ (w.p. > 1 δ/2), ι as a logarithmic term depending on t, dw, Cg, 1/λ, 2/δ (w.p. > 1 δ/2), and ι as a logarithmic term depending on t, dw, Cg, 1/λ.

C.1. Regression Oracle Guarantee

Lemma C.1 (Restatement of Lemma 5.1). Suppose Assumption 3.1 & 3.2 hold. There is an absolute constant C , such that after round n in Phase I of Algorithm 1, with probability > 1 δ/2, regression oracle estimated ˆw0 satisfies

Ex U[(fx( ˆw0) fx(w ))2] C dw F 2ι

where ι is the logarithmic term depending on n, Ch, 2/δ.

Proof. The regression oracle lemma establishes on Lemma B.1 which works only for finite function class. In order to work with our continuous parameter class W, we need ε-covering number argument.

First, let w, f W denote the ERM parameter and finite parameter class after applying covering number argument on W. By Lemma B.1, we find that with probability > 1 δ/2,

Ex U[(fx( w) fx(w ))2] 1 + α

inf w f W {w } Ex U[(fx(w) fx(w ))2] + F 2 log(|f W|) log(2) nα

+ 2 log(4/δ)

F 2 log(|f W|) log(2) nα

+ 2 log(4/δ)

where the second inequality is by realizable assumption (Assumption 3.1). Our parameter class W [0, 1]dw, so log(|f W|) = log(1/εdw) = dw log(1/ε) and the new upper bound is that with probability > 1 δ/2,

Ex U[(fx( w) fx(w ))2] C dw F 2 log(1/ε)

n + log(2/δ)

Global Optimization with Parametric Function Approximation

where C is a universal constant obtained by choosing α = 1/2. Note w is the ERM parameter in f W after discretization, not our target parameter ˆw0 W. By (a + b)2 2a2 + 2b2,

Ex U[(fx( ˆw0) fx(w ))2] 2Ex U[(fx( ˆw0) fx( w))2] + 2Ex U[(fx( w) fx(w ))2]

2ε2C2 h + 2C dw F 2 log(1/ε)

n + log(2/δ)

where the second line applies discretization error ε and Assumption 3.2. By choosing ε = 1/ p

n C2 h, we get

n + C dw F 2 log(n C2 h) n + 2C log(2/δ)

n C dw F 2 log(n C2 h) + log(2/δ) n

where we can take C = 2C (assuming 2 < C dw F 2 log(n C2 h)). The proof completes by defining ι as the logarithmic term depending on n, Ch, 2/δ.

Theorem C.2 (Restatement of Theorem 5.2). Suppose Assumption 3.1, 3.2, & 3.3 hold. There is an absolute constant C such that after round n in Phase I of Algorithm 1 where n satisfies

n Cdw F 2ι max µγ/(2 γ)

τ 2/(2 γ) , ζ

with probability > 1 δ/2, regression oracle estimated ˆw0 satisfies

ˆw0 w 2 2 Cdw F 2ι

where ι is the logarithmic term depending on n, Ch, 2/δ.

Proof. Recall the definition of expected loss function L(w) = Ex U(fx(w) fx(w ))2 and the second order Taylor s theorem, L( ˆw0) at w can be written as

L( ˆw0) = L(w ) + ( ˆw0 w ) L(w ) + 1

2 ˆw0 w 2 2L( w),

where w lies between ˆw0 and w . Also, because L(w ) = Ex U(fx(w ) fx(w ))2 = 0, then with probability > 1 δ/2,

1 2 ˆw0 w 2 2L( w) = L( ˆw0) L(w ) C dw F 2ι

where the inequality is due to Lemma 5.1.

Next, we prove the following lemma stating after a certain number of n samples, ˆw0 w 2L(w ) can be bounded by the parameter c from our local-self-concordance assumption.

Lemma C.3. Suppose Assumption 3.1, 3.2, & 3.3 hold. There is an absolute constant C such that after round n in Phase I of Algorithm 1 where n satisfies

n 2C dw F 2ι max µγ/(2 γ)

τ 2/(2 γ) , ζ

then with probability > 1 δ/2,

ˆw0 w 2L(w ) c.

Proof. First we will prove that when n satisfies the first condition, then ˆw0 w 2 (τ/µ)1/(2 γ) by a proof by contradiction.

Global Optimization with Parametric Function Approximation

Assume ˆw0 w 2 > (τ/µ)1/(2 γ). Check that under this condition, we have τ

2 ˆw0 w γ 2 < µ

2 ˆw0 w 2 2, therefore the growth-condition (rather than the local strong convexity) part of the Assumption 3.3 is active. By the (τ, γ)-growth condition, we have

τ 2 ˆw0 w γ 2 L( ˆw0) L(w ) C dw F 2ι

Substituting the first lower bound of n in the assumption, we get

ˆw0 w (τ/µ)1/(2 γ),

thus having a contradiction. This proves that when n satisfies the first condition, ˆw0 is within the region where local strong convexity is active.

By the local strong-convexity condition,

2 ˆw0 w 2 2 L( ˆw0) L(w ) C dw F 2ι

ˆw0 w 2L(w ) p

2ζC dw F 2ι

Substitute the second lower bound on n that we assumed, we get that

ˆw0 w 2L(w )

2ζC dw F 2ι

Now we continue the proof of Theorem 5.2. Observe that w w 2L(w ) ˆw0 w 2L(w ) c, since w lies on the line-segment between ˆw0 and w . It follows that by the c-local self-concordance assumption (Assumption 3.3),

(1 c)2 ˆw0 w 2 2L(w ) ˆw0 w 2 2L( w).

Therefore, by eq. (11)

ˆw0 w 2 2L(w ) 2C dw F 2ι

The proof completes by inequality ˆw0 w 2 2 ˆw0 w 2 2L(w )/µ due to µ-strongly convexity of L(w) at w

(Assumption 3.3) and defining C = 2C /(1 c)2.

C.2. Properties of Covariance Matrix Σt

In eq. (3), Σt is defined as λI + Pt 1 i=0 fxi( ˆwi) fxi( ˆwi) . In this section, we prove three lemmas saying the change of Σt as t 1, ..., T is bounded in Phase II of GO-UCB. The key observation is that at each round i, the change made to Σt is fxi( ˆwi) fxi( ˆwi) , which is only rank one.

Lemma C.4 (Adapted from Agarwal et al. (2021)). Set Σt, ˆwt as in eq. (3) & (5), suppose Assumption 3.1 & 3.3 hold, and define ut = fxt( ˆwt) Σ 1 t . Then

det Σt = det Σ0

i=0 (1 + u2 i ).

Global Optimization with Parametric Function Approximation

Proof. Recall the definition of Σt = λI + Pt 1 i=0 fxi( ˆwi) fxi( ˆwi) and we can show that

det Σt+1 = det(Σt + fxt(wt) fxt(wt) )

1 2 t (I + Σ 1

2 t fxt(wt) fxt(wt) Σ 1

= det(Σt) det(I + Σ 1

2 t fxt(wt)(Σ 1

2 t fxt(wt)) )

= det(Σt) det(I + vtv t ),

where vt = Σ 1

2 t fxt(wt). Recall ut is defined as fxt( ˆwt) Σ 1 t . Because vtv t is a rank one matrix, det(I + vtv t ) = 1 + u2 t. The proof completes by induction.

Lemma C.5 (Adapted from Agarwal et al. (2021)). Set Σt as in eq. (3) and suppose Assumption 3.1, 3.2, & 3.3 hold. Then

log det Σt 1

1 + t C2 g dwλ

Proof of Lemma C.4 directly follows definition of Σt and proof of Lemma C.5 involves Lemma C.4 and inequality of arithmetic and geometric means. Note Cg is a constant coming from Assumption 3.2. We do not claim any novelty in proofs of these two lemmas which replace feature vector in linear bandit (Agarwal et al., 2021) with gradient vectors.

Proof. Let ξ1, ..., ξdw denote eigenvalues of Pt 1 i=0 fxi(wi) fxi(wi) , then

k=1 ξk = tr

i=0 fxi(wi) fxi(wi) !

i=0 fxi(wi) 2 2 t C2 g, (12)

where the inequality is by Assumption 3.2. By Lemma C.4,

log det Σt 1

i=0 fxi(wi) fxi(wi) !

k=1 (1 + ξk/λ)

k=1 (1 + ξk/λ)

k=1 (1 + ξk/λ)

1 + t C2 g dwλ

where the second inequality is by inequality of arithmetic and geometric means and the last inequality is due to eq. (12).

Lemma C.6. Set Σt, ˆwt as in eq. (3) & (5) and suppose Assumption 3.1, 3.2, & 3.3 hold. Then

i=0 fxi( ˆwi) Σ 1 t fxi( ˆwi) 2dw log

1 + t C2 g dwλ

A trivial bound of LHS in Lemma C.6 could be simply O(t C2 g/λ). Lemma C.6 is important because it saves the upper bound to be O(log(t C2 g/λ)), which allows us to build a feasible parameter uncertainty ball, shown in the next section.

Global Optimization with Parametric Function Approximation

Proof. First, we prove i {0, 1, ..., t 1}, 0 < fxi( ˆwi) Σ 1 t fxi( ˆwi) < 1. Recall the definition of Σt, it s easy to see that Σt is a positive definite matrix and thus 0 < fxi( ˆwi) Σ 1 t fxi( ˆwi). To prove it s smaller than 1, we need to decompose Σt and write

fxi( ˆwi) Σ 1 t fxi( ˆwi)

= fxi( ˆwi)

i=0 fxi( ˆwi) fxi( ˆwi) ! 1

= fxi( ˆwi)

fxi( ˆwi) fxi( ˆwi) fxi( ˆwi) fxi( ˆwi) + λI +

i=0 fxi( ˆwi) fxi( ˆwi) ! 1

Let A = fxi( ˆwi) fxi( ˆwi) + λI + Pt 1 i=0 fxi( ˆwi) fxi( ˆwi) , and it becomes

fxi( ˆwi) Σ 1 t fxi( ˆwi) = fxi( ˆwi) ( fxi( ˆwi) fxi( ˆwi) + A) 1 fxi( ˆwi).

By applying Sherman-Morrison lemma (Lemma B.2), we have

fxi( ˆwi) Σ 1 t fxi( ˆwi) = fxi( ˆwi) A 1 A 1 fxi( ˆwi) fxi( ˆwi) A 1

1 + fxi( ˆwi) A 1 fxi( ˆwi)

= fxi( ˆwi) A 1 fxi( ˆwi) fxi( ˆwi) A 1 fxi( ˆwi) fxi( ˆwi) A 1 fxi( ˆwi)

1 + fxi( ˆwi) A 1 fxi( ˆwi)

= fxi( ˆwi) A 1 fxi( ˆwi) 1 + fxi( ˆwi) A 1 fxi( ˆwi) < 1.

Next, we use the fact that x (0, 1), x 2 log(1 + x), and we have

i=0 fxi( ˆwi) Σ 1 t fxi( ˆwi)

i=0 2 log 1 + fxi( ˆwi) Σ 1 t fxi( ˆwi)

2 log det Σt 1

1 + t C2 g dwλ

where the last two inequalities are due to Lemma C.4 and C.5.

C.3. Feasibility of Ballt

Lemma C.7 (Restatement of Lemma 5.3). Set Σt, ˆwt as in eq. (3), (5). Set βt as

βt = O dwσ2 + d3 w µ2 + d3 wt µ2T

Suppose Assumption 3.1, 3.2, & 3.3 hold and choose n =

T. Then t [T] in Phase II of Algorithm 1, w.p. > 1 δ,

ˆwt w 2 Σt βt.

Proof. The proof has three steps. First we obtain the closed form solution of ˆwt. Next we derive the upper bound of ˆwi w 2 2. Finally we use it to prove that the upper bound of ˆwt w 2 Σt matches our choice of βt.

Step 1: Closed form solution of ˆwt. The optimal criterion for the objective function in eq. (4) is

0 = λ( ˆwt ˆw0) +

i=0 (( ˆwt ˆwi) fxi( ˆwi) + fxi( ˆwi) yi) fxi( ˆwi).

Global Optimization with Parametric Function Approximation

Rearrange the equation and we have

λ( ˆwt ˆw0) +

i=0 ( ˆwt ˆwi) fxi( ˆwi) fxi( ˆwi) =

i=0 (yi fxi( ˆwi)) fxi( ˆwi),

λ( ˆwt ˆw0) +

i=0 ( ˆwt ˆwi) fxi( ˆwi) fxi( ˆwi) =

i=0 (yi fxi(w ) + fxi(w ) fxi( ˆwi)) fxi( ˆwi),

λ( ˆwt ˆw0) +

i=0 ˆw t fxi( ˆwi) fxi( ˆwi) =

i=0 ( ˆw i fxi( ˆwi) + ηi + fxi(w ) fxi( ˆwi)) fxi( ˆwi),

i=1 fxi( ˆwi) fxi( ˆwi) !

i=0 ( ˆw i fxi( ˆwi) + ηi + fxi(w ) fxi( ˆwi)) fxi( ˆwi),

ˆwtΣt = λ ˆw0 +

i=0 ( ˆw i fxi( ˆwi) + ηi + fxi(w ) fxi( ˆwi)) fxi( ˆwi),

where the second line is by removing and adding back fxi(w ), the third line is due to definition of observation noise η and the last line is by our choice of Σt (eq. (3)). Now we have the closed form solution of ˆwt:

ˆwt = Σ 1 t

i=0 ( ˆw i fxi( ˆwi) + ηi + fxi(w ) fxi( ˆwi)) fxi( ˆwi)

Further, ˆwt w can be written as

ˆwt w = Σ 1 t

i=0 fxi( ˆwi)( fxi( ˆwi) ˆwi + ηi + fxi(w ) fxi( ˆwi))

+ λΣ 1 t ˆw0 Σ 1 t Σtw

i=0 fxi( ˆwi)( fxi( ˆwi) ˆwi + ηi + fxi(w ) fxi( ˆwi))

+ λΣ 1 t ( ˆw0 w )

i=0 fxi( ˆwi) fxi( ˆwi) !

i=0 fxi( ˆwi)( fxi( ˆwi) ( ˆwi w ) + ηi + fxi(w ) fxi( ˆwi))

+ λΣ 1 t ( ˆw0 w )

i=0 fxi( ˆwi)1

2 w ˆwi 2 2fxi( w)

i=0 fxi( ˆwi)ηi

+ λΣ 1 t ( ˆw0 w ), (13)

where the second line is again by our choice of Σt and the last equation is by the second order Taylor s theorem of fxi(w ) at ˆwi where w lies between w and ˆwi.

Step 2: Upper bound of ˆwi w 2 2. Note eq. (13) holds i [T] because all ˆwi are obtained through the same optimization problem, which means

ˆwi w = Σ 1 i

ρ=0 fxρ( ˆwρ)1

2 w ˆwρ 2 2fxρ( w)

ρ=0 fxρ( ˆwρ)ηρ

+ λΣ 1 i ( ˆw0 w ).

By inequality (a + b + c)2 4a2 + 4b2 + 4c2 and definition of Σi, we take the square of both sides and get

ˆwi w 2 2 4

ρ=0 fxρ( ˆwρ)ηρ

+ 4 ˆw0 w 2 2 + 1

ρ=0 fxρ( ˆwρ) w ˆwρ 2 2fxρ( wρ)

Global Optimization with Parametric Function Approximation

Now we use induction to prove the convergence rate of ˆwi w 2 2, i [T]. Recall at the very beginning of Phase II, by Theorem 5.2 (check that the condition on n is satisfied due to our condition on T and the choice of n =

T), with probability > 1 δ/2,

ˆw0 w 2 2 Cdw F 2ι

To derive a claim based on induction, formally, we suppose at round i, there exists some universal constant C such that with probability > 1 δ/2,

ˆwi w 2 2 Cdw F 2ι

Our task is to prove that at round i + 1 with probability > 1 δ/2,

ˆwi+1 w 2 2 Cdw F 2ι

Note C is for induction purpose, which can be different from C.

From eq. (14), at round i + 1 we can write

ˆwi+1 w 2 2 4σ2

λ log det(Σi) det(Σ0) 1

+ 4Cdw F 2ι

ρ=0 fxρ( ˆwρ) w ˆwρ 2 2fxρ( wρ)

1 + i C2 g dwλ

+ 4Cdw F 2ι

ρ=0 fxρ( ˆwρ) w ˆwρ 2 2fxρ( wρ)

λ + 4Cdw F 2ι

ρ=0 fxρ( ˆwρ) w ˆwρ 2 2fxρ( wρ)

where the first inequality is due to self-normalized bound for vector-valued martingales (Lemma B.3 in Appendix B) and Theorem 5.2, the second inequality is by Lemma C.5 and our choice of δi = 3δ/(π2i2), and the last inequality is by defining ι as the logarithmic term depending on i, dw, Cg, 1/λ, 2/δ (with probability > 1 δ/2). The choice of δi guarantees the total failure probability over t rounds is no larger than δ/2. Now we use our assumption ˆwi w 2 2 Cdw F 2ι

µn to bound the last term.

ˆwi+1 w 2 2 4dwσ2ι

λ + 4Cdw F 2ι

µn + C2C2 hd2 w F 4ι2

fxρ( ˆwρ) Σ 1 i+1 fxρ( ˆwρ)

λ + 4Cdw F 2ι

µn + C2C2 hd2 w F 4ι2

ρ=0 fxρ( ˆwρ) Σ 1 i+1 fxρ( ˆwρ)

λ + 4Cdw F 2ι

µn + C2C2 hd3 w F 4iι ι2

where the first inequality is due to smoothness of loss function in Assumption 3.3 and triangular inequality, the second inequality is by Cauchy-Schwarz inequality, and the last inequality is because of Lemma C.6 and defining ι as logarithmic term depending on i, dw, Cg, 1/λ.

What we need is that there exists some universal constant C such that

λ + 4Cdw F 2ι

µn + C2C2 hd3 w F 4iι2ι

λµ2n2 Cdw F 2ι

Global Optimization with Parametric Function Approximation

Note the LHS is monotonically increasing w.r.t i so the inequality must hold when i = T, i.e.,

λ + 4Cdw F 2ι

µn + C2C2 hd3 w F 4Tι2ι

λµ2n2 Cdw F 2ι

Recall the range of our function is [ F, F], given any distribution, the variance σ2 can always be upper bounded by F 2/4, so we just need to show that

λ + 4Cdw F 2ι

µn + C2C2 hd3 w F 4Tι2ι

λµ2n2 Cdw F 2ι

µ2n2ι + 4λµn Cι + C2C2 hd2 w F 2Tι2ι λµn Cι, C2C2 hd2 w F 2Tι2ι Cλµnι + µ2n2ι + 4λµn Cι 0,

where the second and third lines are by rearrangement. A feasible solution on C requires

λ2µ2n2ι2 4C2 hd2 w F 2Tι2ι (µ2n2ι + 4λµn Cι) 0,

λ2µ2n 4C2 hd2 w F 2Tι (µ2nι + 4λµCι) 0, (15)

where the second line is by rearrangement. Substitute our choices of λ = Cλ

T and solve the quadratic inequality for Cλ; we get that it suffices to choose

C2 hd2w F 2ι ι + 16C2C4 hd4w F 4ι2ι 2

µ2 = O d2 w µ

with assumption dw > µ. Check that Cλ depends only logarithmically on T and that it ensures eq. (15) holds, therefore certifying that a universal constant C exists. Therefore, by induction, we prove that i [T] there exists a universal constant C such that with probability > 1 δ/2,

ˆwi w 2 2 Cdw F 2ι

With this result, now we are ready to move to Step 3.

Step 3: Upper bound of ˆwt w 2 Σt. Multiply both sides of eq. (13) by Σ

1 2 t and we have

1 2 t ( ˆwt w ) 1

i=0 fxi( ˆwi) w ˆwi 2 2fxi( w)

i=0 fxi( ˆwi)ηi

2 t ( ˆw0 w ).

Take square of both sides and by inequality (a + b + c)2 4a2 + 4b2 + 4c2 we obtain

ˆwt w 2 Σt 4

i=0 fxi( ˆwi)ηi

+ 4λ2 ˆw0 w 2 Σ 1 t +

i=0 fxi( ˆwi) w ˆwi 2 2fxi( w)

The remaining proof closely follows Step 2, i.e.,

ˆwt w 2 Σt 4dwσ2ι + 4λCdw F 2ι

µn + C2C2 hd2 w F 4ι2

fxi( ˆwi) Σ 1 t fxi( ˆwi)

4dwσ2ι + 4λCdw F 2ι

µn + C2C2 hd2 w F 4ι2

i=0 fxi( ˆwi) Σ 1 t fxi( ˆwi)

4dwσ2ι + 4λCdw F 2ι

µn + C2C2 hd3 w F 4tι ι2

O dwσ2 + d3 w µ2 + d3 wt µ2T

Global Optimization with Parametric Function Approximation

where the last inequality is by our choices of λ = Cλ

T. Therefore, our choice of

βt = O dwσ2 + d3 w µ2 + d3 wt µ2T

guarantees that w is always contained in Ballt with probability 1 δ.

C.4. Regret Analysis

Lemma C.8 (Restatement of Lemma 5.4). Set Σt, ˆwt, βt as in eq. (3), (5), & (7) and suppose Assumption 3.1, 3.2, & 3.3 hold, then with probability > 1 δ, w is contained in Ballt. Define ut = fxt( ˆwt) Σ 1 t , then t [T] in Phase II of Algorithm 1,

βtut + 2βt Ch

Proof. By definition of instantaneous regret rt,

rt = fx (w ) fxt(w ).

Recall the selection process of xt and define w = argmaxw Ballt fxt(w),

rt fxt( w) fxt(w ) = ( w w ) fxt( w),

where the equation is by first order Taylor s theorem and w lies between w and w which means w is guaranteed to be in Ballt since Ballt is convex. Then, by adding and removing terms,

rt = ( w ˆwt + ˆwt w ) ( fxt( ˆwt) fxt( ˆwt) + fxt( w))

w ˆwt Σt fxt( ˆwt) Σ 1 t + ˆwt w Σt fxt( ˆwt) Σ 1 t + ( w ˆwt) ( fxt( wt) fxt( ˆwt))

+ ( ˆwt w ) ( fxt( w) fxt( ˆwt)),

where the last inequality is due to Holder s inequality. By definitions of βt in Ballt and ut = fxt( ˆwt) Σ 1 t ,

βtut + ( w ˆwt) ( fxt( w) fxt( ˆwt)) + ( ˆwt w ) ( fxt( w) fxt( ˆwt)).

Again by first order Taylor s theorem where w lies between w and ˆw and thus w lies in Ballt,

βtut + ( w ˆwt) Σ

2 t 2fxt( w)Σ 1

1 2 t ( w ˆwt) + ( ˆwt w ) Σ

2 t 2fxt( w)Σ 1

1 2 t ( w ˆwt)

βtut + ( w ˆwt) Σ

1 2 t 2 Σ 1

2 t 2fxt( w)Σ 1

1 2 t ( w ˆwt) 2

+ ( ˆwt w ) Σ

1 2 t 2 Σ 1

2 t 2fxt( w)Σ 1

1 2 t ( w ˆwt) 2

βtut + 2βt Ch

where the second inequality is by Holder s inequality and the last inequality is due to definition of βt in Ballt, Assumption 3.2, and our choice of Σt.

Lemma C.9 (Restatement of Lemma 5.5). Set Σt, ˆwt, βt as in eq. (3), (5), & (7) and suppose Assumption 3.1, 3.2, & 3.3 hold, then with probability > 1 δ, w is contained in Ballt and t [T] in Phase II of Algorithm 1,

t=1 r2 t 16βT dw log

1 + TC2 g dwλ

+ 8β2 T C2 h T λ2 .

Global Optimization with Parametric Function Approximation

Proof. By Lemma 5.4 and inequality (a + b)2 2a2 + 2b2,

t=1 8βtu2 t + 8β2 t C2 h λ2

i=1 u2 t + 8β2 T C2 h T λ2

16βT dw log

1 + TC2 g dwλ

+ 8β2 T C2 h T λ2 ,

where the second inequality is due to βt is increasing in t and the last inequality is by Lemma C.6.

By putting everything together, we are ready to prove the main cumulative regret theorem.

Proof of Theorem 4.1. By definition of cumulative regret including both Phase I and II,

16TβT dw log 1 + TC2g

+ 8T 2β2 T C2 h λ2

TβT dw + T 2β2 T λ2

where the first inequality is due to function range and Cauchy-Schwarz inequality, the second inequality is by Lemma 5.5 and the last inequality is obtained by setting λ = Cλ

T as required by Lemma 5.3 where Cλ is in eq. (16).

Recall that βt is defined in eq. (7), so

βT = O d3 w µ2

The proof completes by plugging in upper bound of βT .

D. Additional Experimental Details

In addition to Experiments section in main paper, in this section, we show details of algorithm implementation and and real-world experiments.

D.1. Implementation of GO-UCB

Noise parameter σ = 0.01. Regression oracle in GO-UCB is approximated by stochastic gradient descent algorithm on our two linear layer neural network model with mean squared error loss, 2000 iterations and 10 11 learning rate. Exactly solving optimization problem in Step 5 of Phase II may not be computationally tractable, so we use iterative gradient ascent algorithm over x and w with 2000 iterations and 10 4 learning rate. βt is set as d3 w F 4t/T. λ is set as

D.2. Real-world Experiments

Hyperparameters can be continuous or categorical, however, in order to fairly compare GO-UCB with Bayesian optimization methods, in all hyperparameter tuning tasks, we set function domain to be [0, 10]dx, a continuous domain. If a hyperparameter

Global Optimization with Parametric Function Approximation

is categorical, we allocate equal length domain for each hyperparameter. For example, the seventh hyperparameter of random forest is a bool value, True or False and we define [0, 5) as True and [5, 10] as False. If a hyperparameter is continuous, we set linear mapping from the hyperparameter domain to [0, 10]. For example, the sixth hyperparameter of multi-layer perceptron is a float value in (0, 1) thus we multiply it by 10 and map it to (0, 10).

Hyperparameters in hyperparameter tuning tasks. We list hyperparameters in all three tasks as follows.

Classification with Random Forest.

1. Number of trees in the forest, (integer, [20, 200]).

2. Criterion, (string, gini , entropy , or logloss ).

3. Maximum depth of the tree, (integer, [1, 10]).

4. Minimum number of samples required to split an internal node, (integer, [2, 10]).

5. Minimum number of samples required to be at a leaf node, (integer, [1, 10]).

6. Maximum number of features to consider when looking for the best split, (string, sqrt or log2 ).

7. Bootstrap, (bool, True or False).

Classification with Multi-Layer Perceptron.

1. Activation function (string, identity , logistic , tanh , or relu ).

2. Strength of the L2 regularization term, (float, [10 6, 10 2]).

3. Initial learning rate used, (float, [10 6, 10 2]).

4. Maximum number of iterations, (integer, [100, 300]).

5. Whether to shuffle samples in each iteration, (bool, True or False).

6. Exponential decay rate for estimates of first moment vector, (float, (0, 1)).

7. Exponential decay rate for estimates of second moment vector (float, (0, 1)).

8. Maximum number of epochs to not meet tolerance improvement, (integer, [1, 10]).

Classification with Gradient Boosting.

1. Loss, (string, logloss or exponential ).

2. Learning rate, (float, (0, 1)).

3. Number of estimators, (integer, [20, 200]).

4. Fraction of samples to be used for fitting the individual base learners, (float, (0, 1)).

5. Function to measure the quality of a split, (string, friedman mse or squared error ).

6. Minimum number of samples required to split an internal node, (integer, [2, 10]).

7. Minimum number of samples required to be at a leaf node, (integer, [1, 10]).

8. Minimum weighted fraction of the sum total of weights, (float, (0, 0.5)).

9. Maximum depth of the individual regression estimators, (integer, [1, 10]).

10. Number of features to consider when looking for the best split, (float, sqrt or log2 ).

Global Optimization with Parametric Function Approximation

0 10 20 30 40 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(a) Random forest (dx = 7)

0 10 20 30 40 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(b) Multi-layer perceptron (dx = 8)

0 10 20 30 40 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(c) Gradient boosting (dx = 11)

Figure 4. Cumulative regrets (the lower the better) of all algorithms in real-world hyperparameter tuning task on Australian dataset.

0 10 20 30 40 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(a) Random forest (dx = 7)

0 10 20 30 40 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(b) Multi-layer perceptron (dx = 8)

0 10 20 30 40 Time Horizon

Cumulative Regret

GP-UCB GP-EI GP-PI Tu RBO GO-UCB

(c) Gradient boosting (dx = 11)

Figure 5. Cumulative regrets (the lower the better) of all algorithms in real-world hyperparameter tuning task on Diabetes dataset.

11. Maximum number of leaf nodes in best-first fashion, (integer, [2, 10]).

Results on Australian and Diabetes datasets. Due to page limit of the main paper, we show experimental results of hyperparameter tuning tasks on Australian and Diabetes datasets in Figure 4 and Figure 5. Our proposed GO-UCB algorithm performs consistently better than all other algorithms, which is the same as on Breast-cancer dataset in main paper.