# samplethenoptimize_batch_neural_thompson_sampling__ed7455da.pdf

Sample-Then-Optimize Batch Neural Thompson Sampling

Zhongxiang Dai , Yao Shu , Bryan Kian Hsiang Low , Patrick Jaillet

Dept. of Computer Science, National University of Singapore, Republic of Singapore

Dept. of Electrical Engineering and Computer Science, MIT, USA

{daizhongxiang,shuyao,lowkh}@comp.nus.edu.sg ,jaillet@mit.edu

Bayesian optimization (BO), which uses a Gaussian process (GP) as a surrogate to model its objective function, is popular for black-box optimization. However, due to the limitations of GPs, BO underperforms in some problems such as those with categorical, high-dimensional or image inputs. To this end, recent works have used the highly expressive neural networks (NNs) as the surrogate model and derived theoretical guarantees using the theory of neural tangent kernel (NTK). However, these works suffer from the limitations of the requirement to invert an extremely large parameter matrix and the restriction to the sequential (rather than batch) setting. To overcome these limitations, we introduce two algorithms based on the Thompson sampling (TS) policy named Sample-Then-Optimize Batch Neural TS (STO-BNTS) and STO-BNTS-Linear. To choose an input query, we only need to train an NN (resp. a linear model) and then choose the query by maximizing the trained NN (resp. linear model), which is equivalently sampled from the GP posterior with the NTK as the kernel function. As a result, our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy. Next, we derive regret upper bounds for our algorithms with batch evaluations, and use insights from batch BO and NTK to show that they are asymptotically no-regret under certain conditions. Finally, we verify their empirical effectiveness using practical Auto ML and reinforcement learning experiments.

1 Introduction

Bayesian optimization (BO), also called Gaussian process (GP) bandits, has become a celebrated method for optimizing expensive-to-compute black-box functions, primarily thanks to its practical sample efﬁciency and theoretically guaranteed convergence [11, 14, 21, 51]. However, there are important problem settings where BO either underperforms or is not even applicable without sophisticated modiﬁcations, such as problems with categorical [18], high-dimensional [30], or images inputs. These issues have arisen mainly because GPs (i.e., the surrogate used by BO to model the objective function) are not able to effectively model these types of input space, which therefore calls for the use of alternative surrogate models in BO. To this end, neural networks (NNs) serve as a natural candidate owing to their remarkable expressivity [37]. NNs have repeatedly proven their ability to model extremely complicated real-world functions such as those involving categorical, high-dimensional or image inputs, whereas the development of GPs to effectively model these functions still represents active areas of research. In this regard, [69] have adopted NNs as the surrogate model in contextual bandit problems and employed the theoretical framework of neural tangent kernel (NTK) [27] to construct a principled algorithm following the well-known policy of upper conﬁdence bound (UCB), hence introducing the Neural UCB algorithm. More recently, [66] have extended Neural UCB to

Correspondence to: Yao Shu <shuyao@comp.nus.edu.sg>.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

follow the Thompson sampling (TS) policy and proposed the Neural TS algorithm. Both Neural UCB and Neural TS are equipped with upper bounds on their cumulative regret and perform competitively in real-world contextual bandit experiments.

However, Neural UCB and Neural TS are still faced with several important limitations that may hinder their practical applications. Firstly, these algorithms suffer from the requirement to invert a p p matrix in every iteration, in which p is the number of parameters of the NN surrogate and is usually extremely large since the theory of NTK requires severe overparameterization. In practice, a diagonal approximation is used to avoid the need to invert such a large p p matrix [66, 69], which introduces approximation errors in their practical deployment that are unaccounted for in the theoretical analysis, hence causing a disparity between theory and practice. Secondly, the theoretical analyses of these algorithms are restricted to the sequential setting and are hence not applicable in the batch setting where an entire batch of inputs are selected for querying.1 To overcome these two limitations, we introduce two algorithms based on the TS policy named Sample-Then-Optimize Batch Neural Thompson Sampling (STO-BNTS) and STO-BNTS-Linear, both of which (1) sidestep the need to invert the p p matrix and hence close the gap between theory and practice, and (2) naturally support batch evaluations while preserving the theoretical guarantees.

To avoid the inversion of the p p matrix while still ensuring the validity of the TS policy, we draw inspirations from sample-then-optimize optimization [42] and Bayesian deep ensembles [24] to efﬁciently sample functions from the GP posterior (with the NTK as the kernel function) without inverting the p p matrix. Speciﬁcally, to choose an input query within a batch, our STO-BNTS (resp. STO-BNTS-Linear) only needs to use the current observation history to train an NN surrogate (resp. a linear model deﬁned w.r.t. the random features embedding of the NTK associated with an NN surrogate) using randomly initialized parameters, and then choose the input that maximizes the trained NN (resp. trained linear model). As a result, if the NN surrogate has an inﬁnite width, the function of the trained NN (resp. trained linear model) is equivalently sampled from the GP posterior with the NTK as the kernel function. This ensures that our algorithms follow the TS policy and hence lays the foundation for our theoretical analyses. Next, to address the second challenge of deriving theoretical guarantees for our algorithms in the batch setting, we generalize the theoretical analysis of sequential TS [7] to account for batch evaluations and derive a regret upper bound for both of our algorithms when the NN surrogate is inﬁnite-width. Then, we leverage insights from batch BO [17] and NTK [31] to show that the regret upper bound is sub-linear (under some conditions), which implies that our algorithms are asymptotically no-regret. Next, when the NN surrogate is ﬁnite-width, we derive a regret upper bound for our STO-BNTS-Linear by carefully accounting for the approximation error caused by the use of a ﬁnite (instead of inﬁnite) NN, and show that the regret upper bound of STO-BNTS-Linear remains sub-linear as long as the NN is wide enough.

Our contributions are summarized as follows:

Our STO-BNTS and STO-BNTS-Linear algorithms sidestep the inversion of the p p matrix required by Neural UCB and Neural TS, which closes their gap between theory and practice. Our algorithms naturally support batch evaluations with theoretical guarantees. Our algorithms are equipped with an upper bound on their cumulative regret when the NN surrogate is inﬁnite-width, and are asymptotically no-regret (i.e., their regret upper bound is sub-linear) under certain conditions. Moreover, when the NN surrogate is ﬁnite-width, our STO-BNTS-Linear still enjoys a regret upper bound, which remains sub-linear as long as the NN is wide enough. We demonstrate our empirical effectiveness in real-world experiments including automated machine learning (Auto ML) and reinforcement learning (RL) tasks, as well as a task on optimization over images. To the best of our knowledge, our experiments (Sec. 5) are the ﬁrst empirical study to show the advantage of neural bandit over GP bandit algorithms in practical Auto ML and RL tasks.

2 Background

Neural Networks and Neural Tangent Kernel. In this work, we adopt the same construction of a neural network (NN) as [2]. We use f(x; θ) to denote the scalar output of an (L + 1)-layer NN with parameters θ Rp and input x, and use θf(x; θ ) to represent the gradient of the NN evaluated at

1The work of [23] focuses on the contextual bandit setting and aims to choose a batch of inputs given a batch of diverse contexts. So, their methods are not applicable in our setting of BO where the contexts are ﬁxed in all iterations, since they do not explicitly encourage input diversity which is a crucial problem in batch BO [8, 17].

θ = θ . For simplicity, we assume every layer of the NN has the same width and represent the width as m. We denote our initializaiton scheme for θ as init( ) which simply independently samples every NN parameter from the standard Gaussian distribution. The NTK [27] provides an explicit connection between NNs trained via gradient descent and kernel regression using NTK as the kernel function [2]. The NTK matrix, denoted as Θ, has been shown to stay constant during the course of training as the width m of the NN approaches inﬁnity [27]. Moreover, Θ can be approximated by an empirical NTK eΘ [2] calculated using a ﬁnite-width NN: eΘ(x, x ) θf(x; θ0), θf(x ; θ0) Θ(x, x ), where θ0 init( ) denotes initial parameters and θf(x; θ0) is referred to as the neural tangent features [66]. We refer the readers to the works of [2, 27] for a more detailed background on NTK.

Problem Setting. We aim to maximize a black-box function f : X R, i.e., ﬁnd x arg maxx X f(x), in which the domain X is a ﬁnite subset of the d-dimensional unit ball: X {x| x 2 1}. Of note, our theoretical results allow X to be very large because our regret upper bounds only depend on its cardinality |X| logarithmically. Moreover, all our theoretical results can be easily extended to problems with continuous input domains with an additional assumption on the Lipschitz continuity of f (Appendix E). We focus on the noisy setting, i.e., for every queried x, we observe a noisy output y(x) = f(x) + ζ where ζ N(0, σ2). For simplicity, we focus on the setting of synchronous batch BO with a batch size B where a new batch of B inputs are selected only after all evaluations of the previous batch are completed [17]. However, our theoretical results also hold for asynchronous batch BO where a new input query is selected once any pending query is completed (Appendix C). We denote the ith selected input in iteration t as xi t. We analyze the cumulative regret of our algorithms: RT = PT/B t=1 PB i=1(f(x ) f(xi t)), because if RT is shown to be sub-linear in T, then the simple regret ST = mint,i(f(x ) f(xi t)) RT /T goes to 0 asymptotically, which implies that our algorithm is asymptotically no-regret.

3 Sample-Then-Optimize Batch Neural Thompson Sampling

Our STO-BNTS and STO-BNTS-Linear algorithms are presented in Algos. 1 and 2. In both algorithms, the NN surrogate f(x; θ) can be either inﬁnite-width or ﬁnite-width. Both STO-BNTS and STO-BNTS-Linear follow the TS policy to select an input query xi t: They ﬁrstly (a) obtain a function f i t(x; θi t) which is equivalently sampled from the GP posterior with the NTK as the kernel: GP(µt 1( ), β2 t σ2 t 1( , )) [24] (see Appendix A for details), and then (b) maximize the function to select the next query: xi t = arg maxx X f i t(x; θi t). Step (a) is achieved via the sample-then-optimize procedure, i.e., by ﬁrstly sampling initial parameters (θ0 and θ 0) to construct a function f i t(x; θ), and then optimizing the function using gradient descent to obtain the resulting function of f i t(x; θi t).

STO-BNTS (Algo. 1). In every iteration t of STO-BNTS, we ﬁrstly construct an NN f(x; θ) and multiply its output by βt = 2 log(π2t2|X|/(3δ)), in which δ (0, 1) (Theorem 1).2 Next, to choose the ith query xi t, we start by sampling initial parameters θ0 init( ) and θ 0 init( ) independently, and then set the parameters of θ 0 in the last layer to 0 (lines 4-5). Next, we use the resulting θ0 and θ 0, as well as the NN f(x, θ), to construct a function f i t(x, θ) (line 6). Subsequently, in line 7, using the current history of observations (denoted as Dt 1) as the training set, we train f i t(x, θ) (setting θ0 as the initial parameters) using gradient descent with the following loss function:

Lt(θ, Dt 1) = Xt 1

j=1(yj τ f i t(xj τ; θ))2 + β2 t σ2 θ θ0 2 2 , (1)

in which σ2 is the observation noise variance (Sec. 2). After the training, we use the resulting function f i t(x; θi t) as the acquisition function to choose the ith query: xi t = arg maxx X f i t(x; θi t) (line 8). This procedure (lines 4-8) is repeated independently for B 1 times, after which a batch of B queries {xi t}i=1,...,B are selected and then queried to produce the observations {yi t}i=1,...,B. Next, the newly collected input-output pairs {(xi t, yi t)}i=1,...,B are added to Dt 1 and the algorithm proceeds to the next iteration. Importantly, if the NN f(x; θ) is inﬁnite-width, the function f i t(x; θi t) obtained after the training in line 7 is a sample from the GP posterior with the NTK as the kernel: GP(µt 1( ), β2 t σ2 t 1( , )) [24] (Appendix A). This ensures the validity of the TS policy and is crucial for deriving the theoretical guarantee of STO-BNTS (Sec. 4).

STO-BNTS-Linear (Algo. 2). Similar to STO-BNTS, at the beginning of iteration t, STO-BNTSLinear ﬁrstly constructs an NN f(x; θ) and multiply its output by βt. To choose the ith query in

2Note that βt is introduced only for the theoretical analysis, and hence we set βt = 1 in our experiments.

Algorithm 1 STO-BNTS

1: for t = 1, 2, . . . , T/B do 2: Construct NN f(x; θ) and multiply its output by βt (Theorem 1) 3: for i = 1, 2, . . . , B do 4: Sample θ0 init( ) 5: Sample θ 0 init( ) and set the parameters of θ 0 in the last layer to 0 6: Set f i t(x; θ) = f(x; θ) + θf(x; θ0), θ 0 7: Use observation history Dt 1 to train f i t(x; θ) with the loss function Lt(θ, Dt 1) (1) (setting θ0 as the initial parameters) using gradient descent till convergence, to obtain θi t = arg minθLt(θ, Dt 1) 8: Choose xi t = arg maxx X f i t(x; θi t) 9: Query the batch {xi t}i=1,...,B to yield the observations {yi t}i=1,...,B, and add them to Dt 1

Algorithm 2 STO-BNTS-Linear

1: for t = 1, 2, . . . , T/B do 2: Construct NN f(x; θ) and multiply its output by βt (Theorem 1) 3: for i = 1, 2, . . . , B do 4: Sample θ 0 init( ), and deﬁne f i t(x; θ) = θf(x; θ 0), θ 5: Sample θ0 init( ) 6: Use observation history Dt 1 to train f i t(x; θ) with the loss function Lt(θ, Dt 1) (1) (setting θ0 as the initial parameters) using gradient descent till convergence, to obtain θi t = arg minθLt(θ, Dt 1) 7: Choose the query xi t = arg maxx X f i t(x; θi t) 8: Query the batch {xi t}i=1,...,B to yield the observations {yi t}i=1,...,B, and add them to Dt 1

iteration t, STO-BNTS-Linear ﬁrstly constructs a linear model f i t(x; θ) using the neural tangent features (i.e., θf(x; θ 0) where θ 0 init( ) are initial parameters) as the input features (line 4). Next, we sample θ0 init( ) (line 5) and use it as the initial parameters to train f i t(x; θ) using gradient descent with the same loss function (1) as STO-BNTS, to produce θi t (line 6). After that, the ith query is selected by maximizing the acquisition function f i t(x; θi t): xi t = arg maxx X f i t(x; θi t) (line 7).

Lines 4-6 of STO-BNTS-Linear can be interpreted as a sample-then-optimize method [42] using the neural tangent features as the input features. As a result, same as STO-BNTS, if an inﬁnite-width NN is used, the function f i t(x; θi t) obtained after the training in line 6 is a sample from the GP posterior with the NTK as the kernel: GP(µt 1( ), β2 t σ2 t 1( , )) (Appendix A). However, in contrast to STOBNTS, if the NN is ﬁnite-width, the function f i t(x; θi t) derived after line 6 of Algo. 2 still corresponds to a sample from the GP posterior with the empirical NTK eΘ(x, x ) = θf(x; θ 0), θf(x ; θ 0) as the kernel. As a result, for inﬁnite-width NNs, STO-BNTS-Linear and STO-BNTS enjoy the same sub-linear (under some conditions) upper bound on their cumulative regret (Sec. 4.1); however, for ﬁnite-width NNs, unlike STO-BNTS, STO-BNTS-Linear still enjoys a regret upper bound which is sub-linear as long as the NN is wide enough (Sec. 4.2).

Although STO-BNTS-Linear (Algo. 2) has the theoretical advantage of a theoretically guaranteed convergence also for ﬁnite-width NNs (Sec. 4.2), we expect STO-BNTS (Algo. 1) to perform better in practice. This is because STO-BNTS explicitly trains an NN surrogate model in every iteration and is hence able to directly exploit the strong representation power of NNs to model the objective function f. In contrast, STO-BNTS-Linear derives its representation power entirely from the neural tangent features of NTK. However, it has been shown [1, 69] that neural tangent features of NTK can not completely realize the representation power of NNs. As a result, STO-BNTS is expected to be more competitive in practice, especially in problems where the strong representation power of NNs is essential for accurately modeling the complex objective functions. We empirically verify this practical advantage of STO-BNTS in our experiments (Sec. 5.5).

4 Theoretical Results

We ﬁrstly prove a regret upper bound for both STO-BNTS and STO-BNTS-Linear using inﬁnite-width NNs, and show that both algorithms are asymptotically no-regret under certain conditions (Sec. 4.1). Next, we derive a regret upper bound for STO-BNTS-Linear when the NN is ﬁnite-width, and show that the regret upper bound remains sub-linear as long as the NN is wide enough (Sec. 4.2).

4.1 Inﬁnite-width NNs

Here we use y1:t to denote the output observations from iterations 1 to t (i.e., t B observations in total) and use y A to denote the vector of observations at a set of inputs A X. Theorem 1 (proof in Appendix C) gives a regret upper bound for both STO-BNTS and STO-BNTS-Linear:

Theorem 1 (Inﬁnite-width NNs). Assume that f is sampled from a GP with the NTK Θ as the kernel function, that |f(x)| B x X for some B > 0, and that Θ(x, x ) K0 x, x X for some K0 > 0. Let δ (0, 1) and βt = 2 log(π2t2|X|/(3δ)). Then with probability of 1 δ,

RT = e O(e C

T( γT + 1))

where e O ignores all logarithmic factors, γT is the max information gain about f from any set of T observations, and C is an absolute constant s.t. max A X,|A| B 1 I(f; y A|y1:t) C, t 1.

The assumption of Θ(x, x ) K0 is natural and in line with [31], and |f(x)| B is also a mild assumption since most practical objective functions have bounded values. Our primary assumption that f is sampled from a GP with the NTK as the kernel function is a common assumption in the analysis of BO algorithms [17, 57]. Of note, compared with the assumptions from previous works using other commonly used kernels such as the squared exponential (SE) kernel [17, 57], our assumption on f holds for more complicated objective functions. Speciﬁcally, the class of functions sampled from a GP with the NTK as the kernel subsumes more non-smooth and sophisticated objective functions compared with the other commonly used kernels. As shown in [17], C can be chosen to be an absolute constant that is independent of B and T as long as we initialize our algorithm using the uncertainty sampling method, which entails choosing the initial inputs by sequentially maximizing the GP posterior variance (more details in Appendix B). Speciﬁcally, for any chosen constant C > 0, as long as we run the uncertainty sampling initialization stage for Tinit iterations such that ((B 1)γTinit)/Tinit C, then max A X,|A| B 1 I(f; y A|y1:t) C, t 1 is guaranteed to be satisﬁed. Since it has been shown by the work of [31] that γT = e O(T (d 1)/d) grows sub-linearly for the NTK,3 therefore, ((B 1)γTinit)/Tinit is decreasing as Tinit increases. As a result, for any chosen C, we are able to choose a ﬁnite Tinit such that the condition ((B 1)γTinit)/Tinit C is satisﬁed.

For example, if we choose C = 1, then the required number of initial iterations is approximately Tinit = Θ((B 1)d). As a result, the regret upper bound will be a summation of 2B Tinit (i.e., the regrets incurred during the initializatioin stage, because the regret at every step is upper-bounded by 2B ) and the regret upper bound from Theorem 1 with C = 1. Since both B and Tinit are constants independent of T (assuming that B is independent of T), the asymptotic regret upper bound can be simpliﬁed into RT = e O(

T(1 + γT )). Plugging in γT = e O(T (d 1)/d), the ﬁnal regret upper bound becomes RT = e O(T 1/2 + T (2d 1)/(2d)) = e O(T (2d 1)/(2d)) which is sub-linear in T and hence implies that our STO-BNTS and STO-BNTS-Linear are both asymptotically no-regret when the NN surrogate is inﬁnite-width.

Moreover, when T B and B is a constant which is independent of T, Theorem 1 gives us insights on the beneﬁt of batch over sequential evaluations. In this case, the regrets incurred during initialization (i.e., e O((B 1)d)) is negligible. Therefore, our analysis above suggests that our algorithms with batch evaluations (B > 1) enjoys the same asymptotic regret upper bound as its sequential counterpart (B = 1) since the resulting regret upper bound of RT = e O(

T(1 + γT )) does not depend on the batch size B. This demonstrates the advantage of batch evaluations because when B > 1, some of our evaluations can be run in parallel, which is not supported by the sequential setting with B = 1. As a simple illustration, for a large T, both the sequential and batch settings achieve a simple regret of the order e O(T 1/(2d)) after T function evaluations. However, since our batch setting can evaluate every B > 1 selected inputs in parallel in every iteration, our batch setting with B > 1 achieves this simple regret after only T/B iterations, which is only a fraction (1/B) of the T iterations required by the sequential setting. This also shows that we enjoy more beneﬁt with a larger batch size B.

3Note that in order to quote the results from [31], we need follow their assumption to assume that the domain X is a subset of the d-dimensional unit hyper-sphere: X {x| x 2 = 1}, which is more strict than our main assumption (Sec. 2) that X is a subset of the unit hyper-ball: X {x| x 2 1}.

4.2 Finite-width NNs

Here we prove a regret upper bound for STO-BNTS-Linear with ﬁnite-width NN. The main technical challenge in the proof lies in the disparity between the exact and empirical NTKs, i.e., the function f is assumed to be sampled from a GP with the exact NTK Θ yet our acquisition function f i t(x; θi t) (line 7 of Algo. 2) is obtained using the empirical NTK eΘ when the NN is ﬁnite-width. To this end, we make use of the following theoretical guarantee on the approximation error between Θ and eΘ. Proposition 1 (Theorem 3.1 of [2]). Choose ε > 0 and δ (0, 1). If the width of every layer in an NN satisﬁes m = Ω( L6

ε4 log(4L|X|2/δ)), then x, x X, we have with probability 1 δ/4 that θf(x, θ0), θf(x , θ0) Θ(x, x ) (L + 1)ε.

Proposition 1 ensures that we can reduce the upper bound (L+1)ε on the approximation error between Θ and eΘ by increasing the width m of the NN. For example, let m = Cntk L6ε 4 log(4L|X|2/δ) for some constant Cntk > 0, then the approximation quality of Proposition 1 can be expressed as (L + 1)ε = Cntk(L + 1)L3/2 log1/4(4L|X|2/δ)m 1/4 which decreases as the width m increases. In our theoretical analysis, we assume that (L + 1)ε 1, which can be satisﬁed as long as m is large enough. The regret upper bound for STO-BNTS-Linear is given in Theorem 2 (proof in Appendix D): Theorem 2 (Finite-width NNs). Let δ (0, 1) and βt = 2 log(2π2t2|X|/(3δ)). Then we have

RT = e O(e C

T( γT + 1) + T 3m 1/8(L + 1)5/4), with probability of 1 δ. Here γT and C are the same as those deﬁned in Theorem 1.

The ﬁrst term in the regret upper bound in Theorem 2 is the same as that of Theorem 1 for inﬁnitewidth NNs and can hence be made sub-linear by using uncertainty sampling as the initialization stage: RT = e O(

T(1 + γT )) = e O(T (2d 1)/(2d)). The second term in the regret upper bound represents the additional regrets incurred by the use of ﬁnite-width NNs, which can also be made sub-linear by choosing m = Ω(T 24). In other words, if the width m of the NN is chosen to be large enough (i.e., if m = Ω(T 24)), then the cumulative regret of STO-BNTS-Linear scales sub-linearly in T.

4.3 Discussion

The assumption on the function f for our theoretical analyses in Theorems 1 and 2 differs from those made by the previous works on neural contextual bandits [66, 69]. Speciﬁcally, these previous works have relied on the assumption of a positive deﬁnite NTK Gram matrix to approximate the value of f (only evaluated at the observed contexts up to iteration T) using a function that is linear in the neural tangent features. When translated into our setting (which is equivalent to the contextual bandit setting where all contexts are ﬁxed for all t 1), the assumption of a positive deﬁnite NTK Gram matrix can be easily violated as long as any input x is queried more than once, thus rendering this assumption unrealistic in our setting. In contrast, we have assumed that f is sampled from a GP with the NTK as the kernel, which is a common assumption in the analysis of BO [17, 57] and allows us to derive a sub-linear regret upper bound. As a result of the different assumptions, our regret upper bounds are not directly comparable with those from the previous works on neural contextual bandits [66, 69].

5 Experiments

We compare our STO-BNTS and STO-BNTS-Linear with the baselines of Neural UCB [69] and Neural TS [66], as well as GP-UCB and GP-TS which use GPs (with the SE kernel) instead of NNs as their surrogate models. The original implementations of Neural UCB [69] and Neural TS [66] are only applicable to discrete domains. So, for a fair comparison in those tasks with continuous domains, we have modiﬁed their implementations to maximize their acquisition functions in the same way as our methods (i.e., through a combination of random search and L-BFGS-B, refer to Appendix F for more details). We ﬁrstly explore some interesting insights about our algorithms using a synthetic experiment in Sec. 5.1. Next, we apply our algorithms to real-world Auto ML (Sec. 5.2) and RL (Sec. 5.3) problems, as well as an optimization task over images (Sec. 5.4). Finally, we discuss some interesting insights from our experiments in Sec. 5.5. We plot the simple regret (or the best found observation till an iteration) when presenting the experimental results, which is the common practice in BO [8, 29]. We have deferred some experimental details to Appendix F due to space limitation. Our code is available at https://github.com/daizhongxiang/sto-bnts.

0.0 0.2 0.4 0.6 0.8 1.0 input

Deep Ensemble

Training Set Objective Function Acquisition Function

0.0 0.2 0.4 0.6 0.8 1.0 input

Training Set Objective Function Acquisition Function

0.0 0.2 0.4 0.6 0.8 1.0 input

STO-BNTS-Linear

Training Set Objective Function Acquisition Function

0 20 40 60 Iterations

Simple Regret

GP-TS STO-BNTS Deep Ensemble STO-BNTS-Linear STO-BNTS-Linear (B = 4)

STO-BNTS-Linear (wider NN) STO-BNTS-Linear (shallower NN) STO-BNTS (shallower NN)

(a) (b) (c) (d) Figure 1: (a-c) Acquisition functions and (d) performances in the synthetic experiment (Sec. 5.1).

5.1 Synthetic Experiment

Here, we sample a smooth function from a GP with an SE kernel (with a length scale of 0.1), deﬁned on a discrete 1-dimensional domain within the range of [0, 1]. For all methods, we use an NN architecture with a depth of L = 8 and a width of m = 64 unless speciﬁed otherwise.

Figs. 1 (a-c) illustrate the acquisition functions f i t(x; θi t) (line 8 of Algo. 1 and line 7 of Algo. 2) of different algorithms. The Deep Ensemble method [36] in Fig. 1a can be regarded as a reduced version of our STO-BNTS algorithm (Algo. 1) in which the term θf(x; θ0), θ 0 (i.e., the second term in line 6 of Algo. 1) is removed. As a result, Deep Ensemble does not enjoy the theoretical guarantees of our algorithms (Sec. 4). In Figs. 1 (a-c), given the same training set (red stars), every method constructs a batch of B = 100 acquisition functions. E.g., STO-BNTS repeats lines 4-7 of Algo. 1 independently for B = 100 times to produce acquisition functions f i t(x; θi t) for i = 1, . . . , 100, which are plotted as the blue lines in Fig. 1b. Note that every acquisition function (blue line) is maximized to select an input query. The ﬁgures show that compared with the naive baseline of Deep Ensemble, our STO-BNTS and STO-BNTS-Linear are able to display more exploratory behaviors in unexplored regions (e.g., the interval of [0.2, 0.4]). This may be explained by our theoretical guarantees (Sec. 4) which imply that both of our algorithms are able to perform exploration in a principled way and hence naturally handle the the exploration-exploitation trade-off. Moreover, it has also been justiﬁed by [24] that the addition of the term θf(x; θ0), θ 0 improves the ability of the NN to characterize the uncertainty of predictions, which corroborates our ﬁndings here.

The simple regrets of different algorithms are plotted in Fig. 1d.4 The ﬁrst interesting observation is that our STO-BNTS (orange) signiﬁcantly outperforms Deep Ensemble (red), which corroborates the insight discussed above positing that the addition of the term θf(x; θ0), θ 0 leads to more principled exploration and hence better performances. Due to its lack of exploration as illustrated in Fig. 1a, Deep Ensemble fails to reach zero regret in Fig. 1d. Furthermore, the discrepancy between the green and purple curves shows that batch evaluations (B = 4) lead to signiﬁcant performance improvement. Moreover, compared with the green curve for which m = 64, using a wider NN (gray curve, m = 512) substantially improves the performance of STO-BNTS-Linear yet employing a shallower NN (yellow curve, m = 16) signiﬁcantly degrades the performance. These observations agree with Theorem 2 which states that a larger width m reduces the regret of STO-BNTS-Linear. Similarly, the NN surrogate model of STO-BNTS should also be wide enough since the use of a narrower NN (light blue curve, m = 16) also leads to a worse performance for STO-BNTS. Lastly, GP-TS (blue) performs competitively in this experiment with a very smooth objective function. However, as we will show in the next two sections, in real-world experiments with more complicated objective functions, GP-based methods are consistently outperformed by our algorithms.

5.2 Real-world Experiments on Automated Machine Learning (Auto ML)

Here, we adopt 3 hyperparameter tuning tasks. We use tasks involving categorical hyperparameters to highlight the advantage of our NN-based over GP-based methods. We use a diabetes diagnosis dataset to tune 6 categorical hyperparameters of random forest (RF), and then use the MNIST dataset to tune 9 hyperparameters (4 continuous and 5 categorical) of XGBoost and 9 hyperparameters (2 continuous and 7 categorical) of convolutional neural networks (CNNs). Fig. 2 plots the results for the RF (a,b) and XGBoost (c,d) tasks, and the results for CNN are shown in Fig. 4 (Appendix F.2).

4To show the beneﬁt of batch evaluations, in all experiments (including real-world experiments), we use the iterations t as the horizontal axis and in every iteration t, we report the largest f(xi t) (yi t in real-world experiments) within a batch (when B > 1) as the observation in this iteration.

0 10 20 30 40 50 Iterations

Validation Error (%)

STO-BNTS (L = 2, m = 256) STO-BNTS (L = 1, m = 512) STO-BNTS (L = 8, m = 64) STO-BNTS (L = 2, m = 256, B = 4) STO-BNTS-Linear (L = 2, m = 256)

Deep Ensemble (L = 2, m = 256)

0 10 20 30 40 50 Iterations

Validation Error (%)

GP-TS GP-UCB Neural TS (L = 2, m = 256) Neural UCB (L = 2, m = 256) STO-BNTS (L = 2, m = 256) STO-BNTS (L = 2, m = 256, B = 4)

0 20 40 60 80 100 Iterations

Validation Error (%)

STO-BNTS (L = 2, m = 256) STO-BNTS (L = 1, m = 512) STO-BNTS (L = 8, m = 64) STO-BNTS (L = 2, m = 256, B = 4) STO-BNTS-Linear (L = 2, m = 256)

Deep Ensemble (L = 2, m = 256)

0 20 40 60 80 100 Iterations

Validation Error (%)

GP-TS GP-UCB Neural TS (L = 2, m = 256) Neural UCB (L = 2, m = 256) STO-BNTS (L = 2, m = 256) STO-BNTS (L = 2, m = 256, B = 4)

(a) (b) (c) (d) Figure 2: Validation errors for hyperparameter tuning of RF (a,b), XGBoost (c,d). B = 1 by default.

0 100 200 300 400 500 Iterations

Cumulative Rewards

GP-TS GP-UCB Neural TS Neural UCB STO-BNTS STO-BNTS (B = 4)

STO-BNTS-Linear STO-BNTS-Linear (B = 4)

0 100 200 300 400 500 Iterations

Cumulative Rewards

GP-TS GP-UCB Neural TS Neural UCB STO-BNTS STO-BNTS (B = 4)

STO-BNTS-Linear STO-BNTS-Linear (B = 4)

0 100 200 300 400 500 Iterations

Cumulative Rewards

GP-TS GP-UCB Neural TS Neural UCB STO-BNTS STO-BNTS (B = 4) STO-BNTS-Linear STO-BNTS-Linear (B = 4)

0 10 20 30 40 50 Iterations

Random Search Neural TS Neural UCB STO-BNTS

(a) (b) (c) (d) Figure 3: Results for (a) 12-D Lunar-Lander, (b) 14-D robot pushing, (c) 20-D rover trajectory planning, and (d) optimization over images in Sec. 5.4. B = 1 unless speciﬁed otherwise.

For each task, we ﬁrstly compare the performances of different variants of our algorithms in Figs. 2a, c and Fig. 4a, which demonstrate a number of interesting insights. Firstly, our STO-BNTS outperforms Deep Ensemble, which is consistent with the results in Fig. 1d and further emphasizes the practical signiﬁcance of the improved exploration performed by our STO-BNTS (Sec. 5.1). Secondly, STOBNTS-Linear is signiﬁcantly outperformed by STO-BNTS in Figs. 2 a and c, which can likely be attributed to its inability to fully leverage the representation power of NNs as we have discussed in Sec. 3 (see more discussions in Sec. 5.5). Next, Figs. 2a, c and Fig. 4a also show that the performance of STO-BNTS tends to suffer when the NN surrogate model is either overly shallow (L = 1, m = 512, orange curves) or overly deep (L = 8, m = 64, red curves), that is, both the orange and red curves underperform signiﬁcantly in two of the three experiments. This is likely because an overly shallow NN lacks the representation power to model complicated objective functions, whereas an overly deep NN may be prone to overﬁtting since the size of the training set here is much smaller than typical deep learning applications. In contrast, the NN architecture of L = 2, m = 256 (blue curves) consistently performs well in all three experiments, therefore, we will use it as the default architecture in the experiments in the next section. Moreover, the beneﬁt of batch evaluations can also be further conﬁrmed by comparing the blue (B = 1) and green (B = 4) curves in Figs. 2a, c and Fig. 4a.

Figs. 2b, d and Fig. 4b show that for the same NN architecture of L = 2, m = 256, our STOBNTS is the best-performing method since it performs better than all baselines in Figs. 2b and d and comparably with them in Fig. 4b. The undesirable performances of Neural UCB and Neural TS may be explained by the errors due to their diagonal matrix approximation (Sec. 1), and the underwhelming performances of GP-UCB and GP-TS may result from the ineffectiveness of GP in modeling categorical inputs (more on this in Sec. 5.5).

5.3 Real-world Experiments on Reinforcement Learning (RL)

Here we optimize the control parameters of 3 RL problems: we tune d = 12 parameters of a heuristic controller for the Lunar-Lander task from Open AI Gym [4], d = 14 parameters of a controller for a robot pushing task [62], and d = 20 parameters for a rover trajectory planning problem [62]. We use L = 2, m = 256 for all methods. The results in these three RL tasks are plotted in Figs. 3a, b and c. Our STO-BNTS and STO-BNTS-Linear (purple and yellow curves) consistently perform the best among all methods with sequential evaluations (B = 1), and our methods with batch evaluations (B = 4, gray and light blue curves) achieve further performance improvements over their sequential counterparts. Of note, despite underperforming in Sec. 5.2, STO-BNTS-Linear achieves comparable performances with STO-BNTS in all three experiments here for both the sequential and batch evaluations, outperforming the other baselines. The inefﬁcacy of GP-TS and GP-UCB here may result from the inability of GPs to effectively model high-dimensional input space [30] (Sec. 5.5).

5.4 Optimization over Images

Some real-world applications require optimizing over an input domain of images. For example, an image recommender system sequentially recommends different images to a user in order to select the image with the best rating/score [65]. In these applications, GP-based BO methods require sophisticated techniques such as convolutional kernels in order to model the inputs of images [59], in contrast, our methods can be easily applied by simply replacing the NN surrogate model with a CNN. Here, we simulate these applications by maximizing a score over the domain of images from the MNIST dataset. For all CNN-based methods, we use a CNN with a convolutional layer followed by a fully connected layer (both with m = 64) as the surrogate model (more details in Appendix F.4). The results (Fig. 3d) show that our STO-BNTS performs the best. Moreover, when using the same CNN architecture, STO-BNTS-Linear fails to achieve comparable performances to the other CNN-based methods since STO-BNTS-Linear is not able to fully leverage the representation power of CNN (Sec. 3). However, again consistent with Theorem 2, increasing the width of the CNN can improve the performance of STO-BNTS-Linear to be comparable with STO-BNTS (Fig. 6 in Appendix F.4).

5.5 Discussion

Secs. 5.2, 5.3 and 5.4 show that our algorithms, although without any special design for a speciﬁc type of problem (e.g., problems involving categorical, high-dimensional or image inputs), are competitive in all these problems thanks to the ability of NNs to model complicated real-world functions. Compared with GP-based BO methods, our algorithms may incur more computation due to the need to train an NN surrogate model. However, the additional computation can be easily overshadowed by the cost of function evaluations since BO is usually used to optimize expensive-to-compute functions.

STO-BNTS vs. STO-BNTS-Linear. Our experimental results have demonstrated some interesting insights on the comparison between our STO-BNTS (Algo. 1) and STO-BNTS-Linear (Algo. 2). As we have discussed in Sec. 3, since STO-BNTS-Linear is unable to explicitly take advantage of the representation power of NNs, it is expected to be less competitive than STO-BNTS in practice especially in problems where the strong representation power of NNs is crucial for accurately modeling the objective function. Examples of such problems include those with categorical (Sec. 5.2) or image (Sec. 5.4) inputs. Interestingly, our results in Secs. 5.2 and 5.4 indeed corroborate this insight by showing that STO-BNTS-Linear is consistently outperformed by STO-BNTS in these experiments. However, note that in such problems, the performance of STO-BNTS-Linear can be signiﬁcantly improved by further increasing the width of the NN (CNN) as we have shown in Fig. 6. In addition, the practical efﬁcacy of STO-BNTS-Linear can also be seen from the experiments in Sec. 5.3, in which STO-BNTS-Linear performs comparably with STO-BNTS and consistently outperforms all other baselines. This demonstrates the empirical competence of STO-BNTS-Linear in some realworld problems such as RL. Moreover, also note that compared with STO-BNTS, STO-BNTS-Linear enjoys the theoretical advantage of having a guaranteed convergence for ﬁnite-width NNs (Sec. 4.2).

Baseline Methods. The underwhelming performances of Neural UCB and Neural TS in our experiments are likely caused by the errors introduced by the diagonal matrix approximation that is used to avoid the inversion of the p p matrix (Sec. 1). The inadequate performances of GP-UCB and GP-TS may be explained by the ineffectiveness of GPs to model objective functions involving categorical [15, 18] or high-dimensional inputs [30, 43], which correspond to the experiments in Secs. 5.2 and 5.3, respectively. Nevertheless, we expect GP-based methods to achieve better performances (than their performances here) in problems with lower-dimensional and purely continuous input space (e.g., GP-TS performs competitively in the synthetic experiment as shown in Fig. 1d).

Depth L and Width m of the NN Surrogate Model. Our experimental results have provided some guidelines on the choices of the depth L and width m of the NN surrogate model in our algorithms. Regarding the depth L, our experiments in Sec. 5.2 have shown that an overly shallow NN usually hurts the performance due to its lack of expressive power, whereas an excessively deep NN is also likely to deteriorate the performance due to overﬁtting. Therefore, we discourage the use of NNs which are either exceedingly shallow or overly deep, and recommend shallower NNs for simpler tasks to prevent overﬁtting and deeper NNs for more complicated tasks to gain enough representation power. The width m should be chosen to be large enough since our experiments in Sec. 5.1 (Fig. 1d) and Sec. 5.4 (Fig. 6 in Appendix F.4) suggest that a larger width usually improves the performance. Moreover, we have shown that the choice of L = 2, m = 256 consistently leads to competitive performances in a wide range of experiments (i.e., all experiments in Sec. 5.2 and Sec. 5.3). Therefore, we recommend it as the default choice for real-world problems.

6 Related Works

The NTK provides a theoretical tool to study the training dynamics of NNs by drawing connections with kernel methods [2, 5, 13, 27, 38], and it has been successfully applied to a number of practical problems such as neural architecture search [52, 54], active learning [61], data valuation [63], among others. The pioneering work of [69] exploited the theory of NTK to introduce the Neural UCB algorithm for contextual bandits, which uses an NN to learn the reward (objective) function and leverages neural tangent features for exploration following the UCB principle. Neural UCB has been followed up by other works which mainly aimed to make Neural UCB more practical. [64] proposed to improve the computational efﬁciency of Neural UCB through the additional assumption that the reward function is linear in the feature mapping of the last layer of the NN, [23] also aimed to reduce the computational cost of Neural UCB by limiting the number of updates of the NN surrogate model, and [44] proposed a method to reduce the memory requirement of generic neural bandit algorithms. The recent work of [41] performed an empirical study of neural bandit algorithms, and discovered that they tend to perform competitively in problems that require learning complicated representations. Above all, the work that is most closely related to our paper is [66], which introduced the Neural TS algorithm. Following similar principles as Neural UCB, Neural TS learns the reward function using an NN surrogate model and constructs the exploration term through the neural tangent features, which are used as, respectively, the mean and variance of the Gaussian distribution from which the reward is sampled for running the TS routine. GP-based BO methods [3, 33, 48, 49] have achieved impressive performances in recent years, and have been extended to various problem settings such as high-dimensional BO [19, 26, 30], multi-ﬁdelity BO [14, 28, 67, 68], meta-BO [10, 50], risk-averse BO [46, 47, 58], multi-agent/collaborative BO [9, 56], non-myopic BO [34, 40], among others. More importantly, they have also been extended to the batch setting, based on either GP-UCB [8, 16, 17, 32, 60] or GP-TS [25, 29, 45, 60].

7 Conclusion

We propose STO-BNTS and STO-BNTS-Linear, both of which sidestep the requirement to invert a large parameter matrix in existing neural bandit algorithms, and naturally support batch evaluations while preserving their theoretical guarantees. Both algorithms are asymptotically no-regret under certain conditions if the NN surrogate is inﬁnite-width, and STO-BNTS-Linear still enjoys sub-linear regret for a ﬁnite-width NN if it is wide enough. A potential limitation is that our theoretical analysis for ﬁnite-width NNs (Sec. 4.2) only holds for STO-BNTS-Linear but not for STO-BNTS, which we will explore in future works. Another promising future topic is to leverage our ability to exploit the strong representation power of NNs to apply our algorithms to other challenging optimization tasks with sophisticated search spaces, such as chemical design [22, 35], neural architecture search [53, 55], etc. Moreover, it is also an interesting future direction to extend our algorithms to handle other problem settings such as those which have been considered by BO (Sec. 6), e.g., multi-ﬁdelity optimization [67, 68], risk aversion/fault tolerance [20, 46, 47], etc. A potential negative societal impact is that our work may promote more adoption of deep learning methods and hence cause more electricity consumption.

Acknowledgments and Disclosure of Funding

This research/project is supported by A*STAR under its RIE2020 Advanced Manufacturing and Engineering (AME) Industry Alignment Fund Pre Positioning (IAF-PP) (Award A19E4a0101).

[1] Z. Allen-Zhu and Y. Li. What can Res Net learn efﬁciently, going beyond kernels? In Proc. Neur IPS, 2019.

[2] S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang. On exact computation with an inﬁnitely wide neural net. ar Xiv:1904.11955, 2019.

[3] S. Balakrishnan, Q. P. Nguyen, B. K. H. Low, and H. Soh. Efﬁcient exploration of reward functions in inverse reinforcement learning via Bayesian optimization. In Proc. Neur IPS, pages 4187 4198, 2020.

[4] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Open AI Gym. ar Xiv:1606.01540, 2016.

[5] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Proc. Neur IPS, volume 32, pages 10836 10846, 2019.

[6] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785 794, 2016.

[7] S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In Proc. ICML, pages 844 853, 2017.

[8] E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis. Parallel Gaussian process optimization with upper conﬁdence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225 240. Springer, 2013.

[9] Z. Dai, Y. Chen, B. K. H. Low, P. Jaillet, and T.-H. Ho. R2-B2: Recursive reasoning-based Bayesian optimization for no-regret learning in games. In Proc. ICML, pages 2291 2301, 2020.

[10] Z. Dai, Y. Chen, H. Yu, B. K. H. Low, and P. Jaillet. On provably robust meta-Bayesian optimization. In Proc. UAI, 2022.

[11] Z. Dai, B. K. H. Low, and P. Jaillet. Federated Bayesian optimization via Thompson sampling. In Proc. Neur IPS, pages 9687 9699, 2020.

[12] Z. Dai, B. K. H. Low, and P. Jaillet. Differentially private federated Bayesian optimization with distributed exploration. In Proc. Neur IPS, pages 9125 9139, 2021.

[13] Z. Dai, Y. Shu, A. Verma, F. X. Fan, B. K. H. Low, and P. Jaillet. Federated neural bandit. ar Xiv:2205.14309, 2022.

[14] Z. Dai, H. Yu, B. K. H. Low, and P. Jaillet. Bayesian optimization meets Bayesian optimal stopping. In Proc. ICML, pages 1496 1506, 2019.

[15] E. Daxberger, A. Makarova, M. Turchetta, and A. Krause. Mixed-variable Bayesian optimization. In Proc. IJCAI, 2020.

[16] E. A. Daxberger and B. K. H. Low. Distributed batch Gaussian process optimization. In Proc. ICML, pages 951 960, 2017.

[17] T. Desautels, A. Krause, and J. W. Burdick. Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. Journal of Machine Learning Research, 15:3873 3923, 2014.

[18] A. Deshwal, S. Belakaria, and J. R. Doppa. Bayesian optimization over hybrid spaces. In Proc. ICML, 2021.

[19] D. Eriksson, M. Pearce, J. Gardner, R. D. Turner, and M. Poloczek. Scalable global optimization via local Bayesian optimization. volume 32, pages 5496 5507, 2019.

[20] F. X. Fan, Y. Ma, Z. Dai, W. Jing, C. Tan, and B. K. H. Low. Fault-tolerant federated reinforcement learning with theoretical guarantee. In Proc. Neur IPS, 2021.

[21] P. I. Frazier. A tutorial on Bayesian optimization. ar Xiv:1807.02811, 2018.

[22] R.-R. Grifﬁths and J. M. Hernández-Lobato. Constrained Bayesian optimization for automatic chemical design using variational autoencoders. Chemical science, 11(2):577 586, 2020.

[23] Q. Gu, A. Karbasi, K. Khosravi, V. Mirrokni, and D. Zhou. Batched neural bandits. ar Xiv:2102.13028, 2021.

[24] B. He, B. Lakshminarayanan, and Y. W. Teh. Bayesian deep ensembles via the neural tangent kernel. In Proc. Neur IPS, 2020. [25] J. M. Hernández-Lobato, J. Requeima, E. O. Pyzer-Knapp, and A. Aspuru-Guzik. Parallel and distributed Thompson sampling for large-scale accelerated exploration of chemical space. In Proc. ICML, pages 1470 1479. PMLR, 2017. [26] T. N. Hoang, Q. M. Hoang, and B. K. H. Low. Decentralized high-dimensional Bayesian optimization with factor graphs. In Proc. AAAI, pages 3231 3238, 2018. [27] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Proc. Neur IPS, 2018. [28] K. Kandasamy, G. Dasarathy, J. Schneider, and B. Póczos. Multi-ﬁdelity Bayesian optimisation with continuous approximations. In Proc. ICML, pages 1799 1808. PMLR, 2017. [29] K. Kandasamy, A. Krishnamurthy, J. Schneider, and B. Póczos. Parallelised Bayesian optimisation via Thompson sampling. In Proc. AISTATS, pages 133 142. PMLR, 2018. [30] K. Kandasamy, J. Schneider, and B. Póczos. High dimensional Bayesian optimisation and bandits via additive models. In Proc. ICML, pages 295 304. PMLR, 2015. [31] P. Kassraie and A. Krause. Neural contextual bandits without regret. ar Xiv:2107.03144, 2021. [32] T. Kathuria, A. Deshpande, and P. Kohli. Batched Gaussian process bandit optimization via determinantal point processes. In Proc. Neur IPS, volume 29, pages 4206 4214, 2016. [33] D. Kharkovskii, Z. Dai, and B. K. H. Low. Private outsourced Bayesian optimization. In Proc. ICML, pages 5231 5242, 2020. [34] D. Kharkovskii, C. K. Ling, and B. K. H. Low. Nonmyopic Gaussian process optimization with macro-actions. In Proc. AISTATS, pages 4593 4604, 2020. [35] K. Korovina, S. Xu, K. Kandasamy, W. Neiswanger, B. Poczos, J. Schneider, and E. Xing. Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations. In Proc. AISTATS, pages 3393 3403. PMLR, 2020. [36] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Proc. Neur IPS, 2017. [37] Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436 444, 2015. [38] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. volume 32, pages 8572 8583, 2019. [39] Z. Li and J. Scarlett. Gaussian process bandit optimization with few batches. In Proc. AISTATS, pages 92 107, 2022. [40] C. K. Ling, B. K. H. Low, and P. Jaillet. Gaussian process planning with Lipschitz continuous reward functions: Towards unifying Bayesian optimization, active learning, and beyond. In Proc. AAAI, pages 1860 1866, 2016. [41] M. Lisicki, A. Afkanpour, and G. W. Taylor. An empirical study of neural kernel bandits. In Neur IPS Workshop on Bayesian Deep Learning, 2021. [42] A. G. d. G. Matthews, J. Hron, R. E. Turner, and Z. Ghahramani. Sample-then-optimize posterior sampling for bayesian linear models. In Neur IPS Workshop on Advances in Approximate Bayesian Inference, 2017. [43] M. Mutn y and A. Krause. Efﬁcient high dimensional Bayesian optimization with additivity and quadrature Fourier features. In Proc. Neur IPS, pages 9005 9016. Curran, 2019. [44] O. Nabati, T. Zahavy, and S. Mannor. Online limited memory neural-linear bandits with likelihood matching. In Proc. ICML, 2021. [45] E. Nava, M. Mutny, and A. Krause. Diversiﬁed sampling for batched bayesian optimization with determinantal point processes. In Proc. AISTATS, pages 7031 7054. PMLR, 2022. [46] Q. P. Nguyen, Z. Dai, B. K. H. Low, and P. Jaillet. Optimizing conditional value-at-risk of black-box functions. In Proc. Neur IPS, pages 4170 4180, 2021. [47] Q. P. Nguyen, Z. Dai, B. K. H. Low, and P. Jaillet. Value-at-risk optimization with Gaussian processes. In Proc. ICML, pages 8063 8072, 2021.

[48] Q. P. Nguyen, S. Tay, B. K. H. Low, and P. Jaillet. Top-k ranking Bayesian optimization. In Proc. AAAI, pages 9135 9143, 2021. [49] Q. P. Nguyen, Z. Wu, B. K. H. Low, and P. Jaillet. Trusted-maximizers entropy search for efﬁcient Bayesian optimization. In Proc. UAI, pages 1486 1495, 2021. [50] D. Salinas, H. Shen, and V. Perrone. A quantile-based approach for hyperparameter transfer learning. In Proc. ICML, pages 8438 8448. PMLR, 2020. [51] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2016. [52] Y. Shu, S. Cai, Z. Dai, B. C. Ooi, and B. K. H. Low. NASI: Labeland data-agnostic neural architecture search at initialization. In Proc. ICLR, 2022. [53] Y. Shu, Y. Chen, Z. Dai, and B. K. H. Low. Neural ensemble search via Bayesian sampling. In Proc. UAI, 2022. [54] Y. Shu, Z. Dai, Z. Wu, and B. K. H. Low. Unifying and boosting gradient-based training-free neural architecture search. In Proc. Neur IPS, 2022. [55] Y. Shu, W. Wang, and S. Cai. Understanding architectures learnt by cell-based neural architecture search. In Proc. ICLR, 2020. [56] R. H. L. Sim, Y. Zhang, B. K. H. Low, and P. Jaillet. Collaborative Bayesian optimization with fair regret. In Proc. ICML, pages 9691 9701, 2021. [57] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. ICML, pages 1015 1022, 2010. [58] S. S. Tay, C. S. Foo, D. Urano, R. C. X. Leong, and B. K. H. Low. Efﬁcient distributionally robust Bayesian optimization with worst-case sensitivity. In Proc. ICML, 2022. [59] M. Van der Wilk, C. E. Rasmussen, and J. Hensman. Convolutional Gaussian processes. In Proc. Neur IPS, 2017. [60] A. Verma, Z. Dai, and B. K. H. Low. Bayesian optimization under stochastic delayed feedback. In Proc. ICML, pages 22145 22167. PMLR, 2022. [61] Z. Wang, P. Awasthi, C. Dann, A. Sekhari, and C. Gentile. Neural active learning with performance guarantees. Proc. Neur IPS, 34:7510 7521, 2021. [62] Z. Wang, C. Gehring, P. Kohli, and S. Jegelka. Batched large-scale Bayesian optimization in high-dimensional spaces. In Proc. AISTATS, pages 745 754. PMLR, 2018. [63] Z. Wu, Y. Shu, and B. K. H. Low. Davinz: Data valuation using deep neural networks at initialization. In International Conference on Machine Learning, pages 24150 24176. PMLR, 2022. [64] P. Xu, Z. Wen, H. Zhao, and Q. Gu. Neural contextual bandits with deep representation and shallow exploration. ar Xiv:2012.01780, 2020. [65] S. Zhang, L. Yao, A. Sun, and Y. Tay. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 52(1):1 38, 2019. [66] W. Zhang, D. Zhou, L. Li, and Q. Gu. Neural Thompson sampling. In Proc. ICLR, 2021. [67] Y. Zhang, Z. Dai, and B. K. H. Low. Bayesian optimization with binary auxiliary information. In Proc. UAI, pages 1222 1232, 2019. [68] Y. Zhang, T. N. Hoang, B. K. H. Low, and M. Kankanhalli. Information-based multi-ﬁdelity Bayesian optimization. In Proc. NIPS Workshop on Bayesian Optimization, 2017. [69] D. Zhou, L. Li, and Q. Gu. Neural contextual bandits with UCB-based exploration. In Proc. ICML, pages 11492 11502. PMLR, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Refer to Section 7.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] Refer to Section 7. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] Refer to Sections 4.1 and 4.2. (b) Did you include complete proofs of all theoretical results? [Yes] Refer to Appendices C and D. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The code is uploaded as part of the Supplementary Material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Refer to Appendix F. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Refer to Appendix F. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] Refer to Appendix F. (b) Did you mention the license of the assets? [Yes] Refer to Appendix F.

(c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] Refer to Appendix F. (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [Yes] Refer to Appendix F. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]